Opened 13 years ago

Last modified 12 years ago

#3956 enhancement new

Add arraysize option to runQuery in adbapi

Reported by: Gerrat Owned by:
Priority: normal Milestone:
Component: core Keywords: adbapi arraysize
Cc: Jean-Paul Calderone, jesstess, David Reid, TimAllen, amaury Branch: branches/adbapi-arraysize-3956
branch-diff, diff-cov, branch-cov, buildbot
Author: Gerrat Rickert, Screwtape, exarkun

Description

The convenience method, runQuery in adbapi doesn't currently support executing a cursor with anything other than the default cursor arraysize. The arraysize attribute controls how many rows are fetched at a time from the database (during a fetchall() or fetchmany() operation). The default cursor arraysize is 1 (according to PEP 249), and using this arraysize results in terrible performance when retrieving a large number of rows from the database.

Attachments (5)

adbapi-arraysize-patch.patch (4.0 KB) - added by Gerrat 13 years ago.
adbapi-arraysize-patch.2.patch (4.4 KB) - added by Gerrat 13 years ago.
adbapi-arraysize-patch3.patch (5.0 KB) - added by Gerrat 13 years ago.
adbapi-arraysize-in-ConnectionPool.diff (6.9 KB) - added by Screwtape 12 years ago.
Moves the cp_arraysize argument to ConnectionPool.init
use-positive-assertions.patch (507 bytes) - added by Screwtape 12 years ago.
New code should use positive assertions, not double-negative assertions.

Download all attachments as: .zip

Change History (44)

comment:1 Changed 13 years ago by Jean-Paul Calderone

Cc: Jean-Paul Calderone added
Keywords: review removed
Owner: changed from Glyph to Gerrat

Hi Gerrat. Thanks for the patch! Here's a bit of feedback:

  1. Twisted supports Python 2.3. This means inlineCallbacks cannot actually be used in Twisted code.
  2. There should be API documentation about the new arraysize parameter which runQuery now accepts.
  3. assertEquals is preferred over failUnless(a == b)

It would be neat if this test could be run independent of any particular real DB-API 2.0 module. The implementation doesn't really tightly depend on any of them; it could be run against a mock module. Then it could run regardless of the real DB-API 2.0 modules being installed or not. It would also make it faster and more reliable. It also might make the test a bit cleaner; with a mock cursor implementation, it's easier to inspect the arraysize attribute of the cursor; the transaction doesn't need to be fiddled with at all, so the test has fewer implementation assumptions encoded in it.

comment:2 Changed 13 years ago by Gerrat

Keywords: review added

Ok, I've taken care of 1-3, separated the test case into it's own class, and added a mock module (or, at least what I think you're asking for). I wasn't really sure how the mock module should track the cursor's arraysize. Since it's not actually running any sql; I found it convenient to just return the cursor's arraysize in the fetchall() method. ...that may not go over so well, so I'm happy for alternate suggestions.

comment:3 Changed 13 years ago by Gerrat

...well, I thought better of having fetchall return the array size (and tweaked the mock module).

comment:4 Changed 13 years ago by Jean-Paul Calderone

Owner: Gerrat deleted

comment:5 Changed 13 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to Gerrat

Looking good. Here's some more feedback:

  1. Some names in the new code don't follow Twisted's naming convention. Instead of foo_bar for functions and variables, please use fooBar (but leave test_arraysize - that is correct!)
  2. test_arraysize needs a docstring. It should probably have a docstring along the lines of ArraySizeTestCase's docstring. Writing a class docstring when there's only one method can be tough. :) The class docstring should probably say something about how the test case is run against a mock dbapi module to verify non-db-level behavior.
  3. call_runQuery could be split into a few statements to make it a bit easier to read. :)
  4. You don't need d in test_arraysize. You can just return that DeferredList.
  5. The change to _runQuery to check for 'arraysize' in kw should use the in operator instead of dict.has_key. _runQuery should also have a docstring that mentions the behavior it has for the arraysize parameter (point 2 from the last review).
  6. The mock module looks good. It could benefit from some docstrings. Also, methods should be separated by two blank lines from each other. Classes by three blank lines.

Thanks again!

Changed 13 years ago by Gerrat

Changed 13 years ago by Gerrat

comment:6 Changed 13 years ago by Gerrat

Keywords: review added
Owner: changed from Gerrat to Jean-Paul Calderone

...well I guess if I forget to tick the checkbox to replace an attached file, a new one gets added instead...and I see no way of deleting either. Let's go with the .2. patch.

  1. ...and what kind of language promotes "Readability counts." ...oh never mind :)
  1. I've added the arraysize parameter to the documentation for runQuery (since that's the actual method that a user would call). I *think* it should just live there, but if you think it should be in _runQuery or that an additional docstring should be added to _runQuery I can certainly do that. (...as an aside, I *really* did add this once for point 2 in the original review...then I grabbed a new copy of the source and overwrote it).

I think I've taken care of 1-6.

comment:7 Changed 13 years ago by Jean-Paul Calderone

Owner: Jean-Paul Calderone deleted

Anyone can re-review this, and I might not get a chance for a while.

comment:8 Changed 13 years ago by jesstess

Cc: jesstess added

The newest patch applies cleanly, the test cases pass, the doc strings don't raise errors in pydoctor, and the generated API docs look fine.

Patch 2 doesn't have any differences from patch 1 to address points 3 and 4 made by exarkun. I assume they have improvements on an overwritten patch I can't see.

Points 1, 2, 5, and 6 by exarkun are addressed.

_runQuery is undocumented, but the rest of the _methods for this class are too. "Documentation is always nice, though".

Otherwise, looks good.

Changed 13 years ago by Gerrat

comment:9 Changed 13 years ago by Gerrat

You are correct - 3 & 4 have improvements from an overwritten patch. For point 3, "_callRunQuery" was one statement - now it's two. I suppose I could stretch it to three if anyone feels that would be easier to read. For point 4, originally there was an extra deferred involved, that fired on a callback from the deferredList; and it was this extra deferred that the method returned. This was redundant, and the deferredList is just returned now.

I wasn't sure what the standard was w.r.t. overwriting previously submitted patches vs. adding an additional patch - I was just overwriting. Perhaps I should have kept a running list of improvements to make the review process easier.

I've added docstrings to _VerifyArraySize and _callRunQuery (to be "nice"), as patch3.

comment:10 Changed 13 years ago by David Reid

Cc: David Reid added
Keywords: review removed

Hi Gerrat,

Thanks for working on this. A few nits.

  1. I don't think copyright notices shouldn't have dates removed from them. So the existing files should be 2001-2009 not just 2009.
  2. There isn't really a good reason for the sizes dict to be sizes instead of just _sizes.
  3. If conn is supposed to be a public attribute of Cursor it should be documented in the doc string if it is not it should just be _conn. It also doesn't look like it's ever used so maybe it shouldn't exist.

comment:11 Changed 13 years ago by Jean-Paul Calderone

Author: Gerrat RickertGerrat Rickert, exarkun
Branch: branches/adbapi-arraysize-3956

(In [27556]) Branching to 'adbapi-arraysize-3956'

comment:12 Changed 13 years ago by Jean-Paul Calderone

(In [27557]) Apply adbapi-arraysize-patch3.patch

refs #3956

comment:13 Changed 13 years ago by Jean-Paul Calderone

(In [27558]) Fix copyright dates

refs #3956

comment:14 Changed 13 years ago by Jean-Paul Calderone

Keywords: review added

Addressed those review points. I also had to change Transaction._cursor to Transaction.cursor because I wasn't happy with ConnectionPool going through a private attribute of the transaction object.

comment:15 Changed 13 years ago by amaury

I think that arraysize is not a good idea for a keyword argument to excute(); this function uses keyword arguments for named binds, and I do have a query where a parameter is named arraysize (It's a stored procedure that copies rows between tables, and uses an internal array).

I can see two solutions to ensure compatibility:

  • add some twisted-specific prefix to arraysize (like the cp_min parameter to ConnectionPool) so that collisions are less likely
  • create a new function runQueryWithArraysize(stmt, arraysize, *args, **kw)

comment:16 Changed 13 years ago by Jean-Paul Calderone

Thanks for the feedback, amaury. That's a good point. Since adbapi seems to have claimed the cp_ prefix already, using that seems like it should be safe enough. It would be nice if these APIs avoided the possibility of collisions entirely, but I think that's a task for another day (unless someone else chimes in and disagrees with the cp_-prefix solution).

comment:17 Changed 13 years ago by TimAllen

Owner: set to TimAllen

comment:18 Changed 13 years ago by TimAllen

Keywords: review removed
Owner: changed from TimAllen to Jean-Paul Calderone
  1. t/e/adbapi.py, in Transaction: All these methods should have docstrings.
  2. t/t/test_adbapi.py, line 669: The TransactionTestCase class has a test_cursor() method that has no code at all, just a docstring.
  3. About t.e.adbapi.Transaction.cursor: It seems that the only read/write attribute on a cursor object is arraysize, and it seems clunky to mess with Transaction's public interface just for that. How about leaving the instance variable as _cursor and adding getArraySize/setArraySize methods? (a property would be even better, but that requires a new-style class)

I'm kind of ambivalent about the whole approach of this patch, really: the only method that runQuery() ever calls on the cursor is fetchall(), which DBAPI-2.0 describes with "Note that the cursor's arraysize attribute can affect the performance of this operation." Presumably in sensible DBAPI modules, fetchall() will read chunks as large as possible rather than limiting itself to arraysize, but Gerrat appears to have found a module that needlessly limits itself, so some configuration is needed.

However, do we really need to set a separate arraysize for every query? Considering we always call fetchall(), presumably we'll always want to use whatever arraysize makes fetchall() fastest. I doubt there's a value that would work for every DBAPI module in every Twisted installation all over the world, but it seems sensible that every query in a given ConnectionPool would want to use the same arraysize. How about adding a cp_arraysize keyword parameter to ConnectionPool, and applying that setting in _runInteraction()? It's a pretty easy way to configure arraysize, it has no backwards compatibility problems, and it shouldn't be too hard for each Twisted user to find a value that improves things overall in their environment.

Of course, when you're pushing for performance there's always exceptions, and some users might need to set arraysize differently for different queries, and maybe use other calls than fetchall(). They can continue using runInteraction() as they presumably already do.

comment:19 Changed 13 years ago by TimAllen

Gerrat posted the following reply to the mailing-list:

Sorry I've dropped the ball on this whole request that I initiated. I'm kind of swamped at work this time of year (I should have more time starting mid Jan.) I have a minute to weigh in though.

As for the exact details on how this is implemented, I don't have any strong preferences. I like the suggestion to add the cp_arraysize parameter to ConnectionPool and apply it in _runInteraction()...wish I'd thought of it.

I also agree that it's unfortunate that the performance of fetchall() is impacted by the arraysize attribute; but I think that if it's a deficiency, the issue is with the DBAPI-2.0 specification, not with this implementation. (It would be kind of pointless mentioning that arraysize could affect the performance of this method, if this method ignored arraysize.)

Regards,

Gerrat

comment:20 Changed 13 years ago by Jean-Paul Calderone

(In [27707]) Actually write test_cursor

Refs #3956

comment:21 Changed 13 years ago by Jean-Paul Calderone

(In [27708]) docstrings for transaction's methods

refs #3956

comment:22 Changed 13 years ago by Jean-Paul Calderone

Keywords: review added
Owner: Jean-Paul Calderone deleted

To elaborate on what Gerrat wrote, here's a few quotes from the DB-API 2.0 PEP:

.fetchmany([size=cursor.arraysize])

Fetch the next set of rows of a query result, returning a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available.

...

Note there are performance considerations involved with the size parameter. For optimal performance, it is usually best to use the arraysize attribute. If the size parameter is used, then it is best for it to retain the same value from one .fetchmany() call to the next.

...

.arraysize

This read/write attribute specifies the number of rows to fetch at a time with .fetchmany(). It defaults to 1 meaning to fetch a single row at a time.

I was very disappointed to read this stupidity, and then I wrote the code in the branch.

I'm also not very excited by the Transaction class, either the existing code or the new public cursor attribute I added to it. I don't see any good reason for this class to exist at all. Passing the cursor (or just the connection) to runInteraction methods would provide just as much utility with less code. However, given that the class is what it is, and is part of a public interface, I think making the cursor attribute public is the best decision. There may be other non-standard methods or attributes (eg last_row_id) which could be useful to applications.

I fixed points 1 and 2, though.

comment:23 Changed 13 years ago by TimAllen

I take it you don't agree with the 'cp_arraysize should be a parameter for ConnectionPool not runQuery' idea, then? I brought it up because the ReviewProcess document says "No backwards-incompatible changes", although I'm not sure how rigidly that is enforced.

Your fixes for points 1 and 2 and your argument for point 3 are fine, but I'll leave this ticket in the review state so that a more experienced reviewer can take a stance on the backwards-compatibility issue.

comment:24 Changed 13 years ago by Jean-Paul Calderone

Cc: TimAllen added

I think making it a parameter to runQuery makes sense, since this gives control over how results are fetched at a per-query level. I think the backwards compatibility issues are the same whether this parameter is added to __init__ or runQuery, but I might be missing something. Can you elaborate on that point? Ideally the compatibility policy would be enforced very rigidly.

comment:25 Changed 13 years ago by TimAllen

Cc: amaury added

The compatibility issue was raised by amaury in comment:15, but to summarise:

  • Previously, all keyword arguments to runQuery() were passed to cursor.execute(), for use with prepared statements or whatever.
  • With this change, all keyword arguments except cp_arraysize will be passed to cursor.execute().

DBAPI-2.0 declares the execute method as execute(query[, args]), expecting args to be a dict (if paramstyle is named or pyformat) or a tuple (for other paramstyles), but there are some DBAPI implementations that accept keyword arguments instead of a dict - cx_Oracle being one. Parameters-by-keyword-argument is also implicitly endorsed on the Python wiki.

On the other hand, none of psycopg2, pgdb, MySQLdb, pyodbc nor sqlite3 allow keyword arguments to cursor.execute(), so for most users I guess it would be completely safe.

I'd be interested to know which database system amaury was using that accepted keyword-arguments.

comment:26 Changed 13 years ago by Dave Peticolas

I use the ConnectionPool class with named binds (cx_Oracle). I have, in fact, sub-classed ConnectionPool so that it sets arraysize on every query (the same value each time) thus making the array size an argument to ConnectionPool rather than runQuery. That said, I think adding it to runQuery is fine.

comment:27 Changed 13 years ago by radix

Maybe instead of making 'cursor' public and requiring its existence, perhaps we could just pass the arraysize to Transaction's constructor? Either way, it's backwards incompatible, so I dunno.

comment:28 Changed 13 years ago by Jean-Paul Calderone

I decided to expose the cursor rather than add more code to Transaction because there doesn't really seem to be any good reason for Transaction to exist in the first place. Given that I'd rather see the cursor being passed along to application code directly, I don't see any good reason to complicate Transaction further.

The argument about cp_arraysize potentially conflicting with an application keyword is valid one. Since I don't think this is going to bother anyone in practice, I'm not extremely worried about it. However, if someone else is, then I don't mind if they change things around to avoid this (but I'm not sure I will bother). Adding an argument to Transaction seems backwards compatible, as long as it's optional.

comment:29 Changed 12 years ago by Screwtape

Keywords: review removed
Owner: set to Screwtape

However, if someone else is, then I don't mind if they change things around to avoid this (but I'm not sure I will bother).

Yeah, it bugs me. I'm going to work on it.

Changed 12 years ago by Screwtape

Moves the cp_arraysize argument to ConnectionPool.init

comment:30 Changed 12 years ago by Screwtape

Keywords: review added
Owner: Screwtape deleted

The attached patch (intended to be applied to the existing branch) does the following:

  1. Bumps copyright dates to 2010. Happy new year!
  2. Adds a new (unfortunately public) arraysize property to ConnectionPool where the default array size for new cursors is kept.
  3. Removes cp_arraysize handling from runQuery() and its associated methods.
  4. Adds code to _runInteraction() to set the arraysize on the Transaction object's cursor, if it has one (Transaction objects in older versions of Twisted didn't have one, substitute objects people might use probably won't have one either. Certainly the mock Transaction object used in the test suite didn't have one). See #2488 for reasons why somebody might substitute their own implementation of Transaction.
  5. Updates exarkun's test to use my new API.
  6. Adds a test to verify that the cursor passed to runInteraction() has its arraysize set, too. If arraysize is being set per-Pool, it's reasonable to expect that every cursor created by the Pool will inherit that arraysize.

Ready for review!

comment:31 Changed 12 years ago by washort

Owner: set to Jean-Paul Calderone

Looks pretty good. +1 for merging.

comment:32 Changed 12 years ago by washort

Keywords: review removed

comment:33 Changed 12 years ago by Jean-Paul Calderone

(In [28119]) Apply adbapi-arraysize-in-ConnectionPool.diff

refs #3956

comment:34 Changed 12 years ago by Jean-Paul Calderone

Author: Gerrat Rickert, exarkunGerrat Rickert, Screwtape, exarkun
Owner: changed from Jean-Paul Calderone to Screwtape

Notwithstanding washort's +1, I have a question or two about the latest changes. What's with the comment about Twisted 9.0.0 and the conditional on getattr? Also the tests should use assertEquals rather than failUnlessEqual.

Changed 12 years ago by Screwtape

New code should use positive assertions, not double-negative assertions.

comment:35 in reply to:  34 Changed 12 years ago by Screwtape

Keywords: review added
Owner: Screwtape deleted

Replying to exarkun:

What's with the comment about Twisted 9.0.0 and the conditional on getattr?

For a long time, ConnectionPool has had a transactionFactory property that it uses to build transaction objects. That property defaults to t.e.adbapi.Transaction but it's not required to be that way — it's not even required to be a subclass or to implement any particular interface. Just because the t.e.adbapi.Transaction now exposes a public .cursor property doesn't mean that every transaction object will have one. For example, test_runWithInteractionRaiseOriginalError in test_adbapi.py uses a DummyTransaction class that doesn't have a .cursor attribute.

I guess that getattr conditional should have an else: clause that issues a warning?

Also the tests should use assertEquals rather than failUnlessEqual.

Patch attached.

comment:36 Changed 12 years ago by Jean-Paul Calderone

Just because the t.e.adbapi.Transaction now exposes a public .cursor property doesn't mean that every transaction object will have one.

Argh.

comment:37 Changed 12 years ago by therve

Keywords: review removed
Owner: set to Jean-Paul Calderone

test_arraysizeAffectsRunInteraction should not use failUnlessEqual. Everything else looks good.

comment:38 Changed 12 years ago by Gerrat

I guess I should have reviewed the current implementation more closely, sooner, but I just noticed a small issue with it. The issue is that it explicitly sets the arraysize to 1 regardless of whether the caller has decided to override the default or not. I believe older versions of cx_oracle used the DBAPI mentioned default of 1 for the arraysize; but the more recent versions uses 50. ...so this implementation will actually make adbapi perform much worse than before (unless user explicitly uses cp_arraysize parameter).

I'm not sure what other dbapi modules use for their default, but whatever they use, I think adbapi should only override if explicitly told to do so.

comment:39 Changed 11 years ago by <automation>

Owner: Jean-Paul Calderone deleted
Note: See TracTickets for help on using tickets.