Opened 8 years ago

Last modified 8 years ago

#4124 enhancement new

Re-implement twisted.mail.imap4's parsers using real parser technology

Reported by: Jean-Paul Calderone Owned by:
Priority: high Milestone:
Component: mail Keywords:
Cc: khorn Branch:


twisted.mail.imap4 deals with the rather complicated task of parsing the IMAP4 protocol (in both directions - the protocol is not symmetrical) with a series of alternately naive, gross, broken, inefficient, poorly tested, and very nearly unmaintainable hacks, tricks, kludges, and plain stupidity (I can say this, I wrote most of it).

#4049 has at last convinced me that this parsing code must be rewritten. This will be difficult for a number of reasons:

  • The IMAP4 grammar is somewhat complicated. This is probably surmountable, but it would be good if someone with significant experience developing parsers could be involved in the process.
  • There is a lot of grammar to account for.
  • Quirks of the existing parser are exposed to the application-level APIs. The most reasonable resolution to this may be the introduction of new APIs and the deprecation of old ones. Of course, figuring out how to continue to support the old APIs with a new parser in place may be tricky.
  • The client and server presently approach the problem from different directions, sharing almost none of their parsing code (but of course not exactly none - that would be too easy).
  • The current parser is wrong in a number of places, yet all the existing tests are passing, so the existing test suite is clearly not sufficient to demonstrate that the new parser is correct.

I think it would be nice if this opportunity were taken to approach parser development differently than it is usually approached in Twisted. If possible, declaratively specifying much or all of the grammar and using this to automatically generate the parser would be preferable (and if we can do it here, I doubt any other protocols will give us much trouble). However, I won't say this is necessary outright, as it is more important to have a working parser than to have a parser with a neat implementation.

See these other open IMAP4-related tickets for some idea of the current problems:

Change History (9)

comment:1 Changed 8 years ago by khorn

Cc: khorn added

comment:2 Changed 8 years ago by ralphm

For the XPath-ish parsing in Words, yapps2 is used. Maybe that is a way forward?

comment:3 Changed 8 years ago by darkporter

I remember reading about how Zed Shaw used Ragel to write mongrel (ruby web server). I wonder if anything similar exists for python? It might be useful for IMAP.

comment:4 Changed 8 years ago by khorn

For no particular reason, I was thinking about this today, and I think a "plan" for doing this might look something like:

  1. Decide what should be the "public" API for IMAPServer/IMAPClient. Make everything else "private" (is there a convention for naming private methods in the Twisted coding standard? I didn't see one...) and deprecate the public interfaces of those attributes/methods.
  2. Isolate the parsing code. Its' kind of all over the place at the moment. A given IMAP line in IMAPServer for instance, goes though a bunch of functions and gets parsed a bit at a time in lots of different places. This by itself isn't sucha a bad thing, but it gets mixed up with a bunch of the other code (like the protocol code, etc.) and needs to be isolated better. Bits of code (functions or whatever) should either be parsing, or doing something else. Ideally you'd end up with something like (simplified!):
    def lineReceived(self, line):

        # pre-parsing stuff here

        func, args, kwargs = parse_line(line)
        func(*args, **kwargs)

        # post-parsing stuff here

or maybe even

    def lineReceived(self, line):

        # pre-parsing stuff here

        partial_func = parse_line(line)

        # post-parsing stuff here

where parse_line() does some partial function application. Not sure if that's really useful though.

At any rate, once the parsing code is more isolated, it will be much easier to experiment with various parsing strategies. I'm not really sure this is even feasible until at least some isolation work is done.

comment:5 Changed 8 years ago by Jean-Paul Calderone

As far as the implementation of parsing goes, that sounds fine. It doesn't address the really tricky part, though, which is the structure of the objects created from parsing server responses and passed back to application code. That's what I was trying to describe with the third point in the ticket description.

This is most evident in for the fetch methods. Server responses are parsed into complicated nested list/tuple/string structures which aren't really ideal to work with. These will probably have to be reproduced based on the results of the new parser to preserve backwards compatibility, but we'll want a new, better structure at some point too.

comment:6 Changed 8 years ago by khorn

AH, yes, I see...I had mostly been looking at the server parsing. Still, I don't think this is necessarily a _huge_ obstacle, though it doesn't look like it will be a lot of fun to straighten out. At least, it's not much huger that replacing the parsing code of the module :).

comment:7 Changed 8 years ago by khorn

An interesting experiment in this arena might be using pyMeta to reimplement a small part of the parsing here, for example parseNestedParens.

If I have time I might do this, but I thought I'd mention it here in case a) someone else thinks it's worth pursuing, and b) they get to it before I do.

comment:8 Changed 8 years ago by khorn

As a first step, refactor the code to separate literal-handling from command-parsing.

See #4561

comment:9 Changed 7 years ago by <automation>

Owner: Jean-Paul Calderone deleted
Note: See TracTickets for help on using tickets.