[Twisted-Python] Re: IMAP fixes

Tony Meyer ta-meyer at ihug.co.nz
Thu Jul 10 02:59:42 EDT 2003


> Why?

Well, efficiency, for one.  timeit rates the single re 5-6 times faster than
the proof of concept that you posted (corrected to work).  That's a
considerable speed difference.

> It's much better to split into several REs. When you need to 
> comment your REs, they're obviously useless

It didn't _need_ to be commented, I did so to point out how simple it
actually was.  This is a bizarre statement.  Does it then follow that if you
need to comment code, it's obviously useless?  Why would the re.VERBOSE flag
even exist if this was the case?

This re really isn't that complex.  If I take out the uncapturing and naming
of groups (which are just to make the code easier to read), and drop the
alternation, like in your proof of concept, it becomes:
BODY\.?(PEEK)?\[(([\d\.]*)\.)?([A-Z\.]+)(
\(([^\(\)]*)\))?\](\<\d+\.[1-9]\d*)\>)?
If this was written out in verbose form, then it would be easily as readable
as the proof of concept.

> -- it is impossible to keep comments in sync with code,

Why?  Change the code, check the comment.  Simple.  How is it any different
than checking that a change you make doesn't effect other parts of the code?

> Not to mention you can't step with PDB through REs, or insert 
> prints, to see what exactly is going wrong.

But there are other tools to test re's.  (In any case, you can insert print
statements, simply by using slices appropriately).

> Here's a proof of concept (untested) factorization into several REs
> each of which is easy enough to describe:
> 
> sectionRe = re.compile('([\d\.])*\.') # digits and dots, ends with dot
[...]
> cket = re.compile('\]') # literal ]

This section is no easier to understand - the only difference is the choice
in separation and wording of comments.

> # And when you write it like that, it is easy to find
> # false positives. For example, should we check s is empty?

You can check whether the single re ends at the end of the string just as
easily; I can't think of any tests that can't be carried out just as easily
with either solution.

> # Note how the parsing is now done with code -- only the low-level
> # tokenizing is done with REs. The code is relatively easy to read.

The code has become relatively more difficult to read, because almost all
the work is done by the re engine.  It's the parsing itself that is
theoretically easier to read.

In addition, having the multiple re's adds additional code that must be
written, tested and maintained.  The single re has the advantage that the
code to do the parsing is contained within the re module, and so is someone
else's problem.  There is only a single re to be sure of, instead of several
lines of code.

For the most part, this is a stylistic choice, not a correct/incorrect one.
The single re has the advantage of speed (important, in this particular
case), and less code to write/maintain.  The multiple re approach makes the
parsing more explicit, and makes changing the re less error-prone (an
unlikely event, in this particular case).

As I indicated, I don't really care whether Jp prefers a single re, multiple
re's like this, or using the string module and more if's; it's up to him
(whether I can be bothered coding the alternatives to offer as a patch is
another question ;).  I still personally believe that the single re is most
appropriate for this case, and given that I can always subclass the version
that gets checked in, it won't make any difference.

=Tony Meyer





More information about the Twisted-Python mailing list