Opened 11 years ago

Last modified 8 years ago

#3201 enhancement new

thoroughly justify xish's existence, or decide to remove it

Reported by: Glyph Owned by:
Priority: normal Milestone:
Component: words Keywords:
Cc: Glyph, jknight, ralphm, itamarst Branch:
Author:

Description

On the one hand, XMPP fans (ralphm, dizzyd, and jack, I believe) suggest that xish is inherently superior for processing XMPP for some reason which I don't yet understand.

On the other hand, Twisted core folks (glyph, exarkun, and itamar, among others) believe that maintaining two mostly-incompatible XML libraries is bad. microdom, at least, has the excuse that it is an implementation of an external specification; xish is its own special API. Our collective suggestion / assumption is that eventually we should replace xish with lxml, which is emerging as the best XML processing library for Python, and has features required for XMPP (like xpath) built in.

I'm making this ticket to keep a record of the discussion. I will close it when we've all had a chance to hear both sides of the issue (and I'll make a decision to file another ticket describing the implementation of the solution). I'm assigning it to ralphm first so that we can get the story on what makes xish particularly good for XMPP and jabber, and if (and why) all of lxml is inappropriate, or only certain parts.

Change History (17)

comment:1 Changed 11 years ago by jknight

+1 to removing both xish and microdom as soon as possible. Maybe we can play a game: Twisted needs an XML library like it needs a .

However, unfortunately, I don't think lxml can do what xish does. Xish's existence was justified by it providing a streaming XML interface, both on the push side and the pull side. I don't believe you can do this today with lxml. I believe you can choose whether you want to push data in ("feed" interface) and get a single result, or iteratively pull data out ("iterparse") given a file-like-object, but not both. It'd be great to correct this defect (or just correct me if you can already do that!) in lxml, and then remove xish, though. :)

comment:2 in reply to:  1 Changed 11 years ago by Glyph

Replying to jknight:

I don't think lxml can do what xish does. Xish's existence was justified by it providing a streaming XML interface, both on the push side and the pull side.

Definitely some part of xish needs to continue to exist to do that. However, xish also provides a custom DOM implementation (including, among other things, a magical __getattr__) which seems peripheral to that goal; it could easily produce elementtree objects. This would also remove the need for its DOM serializer and xpath implementation (and Yapps grammar file), which, in addition to its DOM implementation, comprises over half of its code.

comment:3 Changed 10 years ago by ralphm

Cc: Glyph jknight ralphm added

Here my somewhat lenghty summary of XMPP's XML processing requirements and survey of the different parts in Xish.

XML handling in XMPP differs from regular XML processing in a few ways, as defined in RFC 3920:

  • Stream oriented. Unlike regular XML processing, which is document oriented, XMPP traffic builds up one virtual XML document per direction. First a root element start tag will be sent (the XML stream header), and then the unit of communication is first level child elements ('XML Stanzas') of that root element. When the connection is dropped, if no closing tag for the root element was sent, it is assumed to have been sent implicitely.
  • XMPP only uses (and allows) a strict subset of XML. No processing instructions, no DTDs, no comments. Also, all character data or attribute values that contain characters that map to the predefined entities MUST be escaped.
  • Namespace handling:
    • The XML stream header MUST have a namespace declaration that binds to the streams prefix. The root element (<stream/>) and the <features/> and <error/> child elements MUST be qualified by this prefix.
    • The XML stream header MUST have a default namespace declaration for qualifying the first-level childs.
    • Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.
    • For 'dialback' there is a fixed prefix that, if used, MUST be declared on the XML stream header.

For parsing of incoming XML, a regular SAX-like parser can be used. The domish module uses Expat and falls back to twisted.web.sux. However, given the restrictions to XML handling given above, serialization of XML is more involved. In particular, XML libraries typically do not adhere to the namespace handling rules required by XMPP.

To process XML Stanzas, it is desirable to work with DOM like structures, but unlike document oriented XML processing, partial DOM like structures. A common pattern is to respond to incoming XML Stanzas by creating a new partial DOM for the response, often using parts of the incoming Stanza, and send the serialization of that partial DOM over the wire.

Another pattern, mostly for XMPP servers, is to take in a Stanza from one stream (e.g. a client) and send it on to another stream (e.g. a server-side component or a server). Client, server and component streams all use different default namespaces, that are very similar. I.e. a Stanza from a client stream in the default client namespace is mostly identical when sent over a server-to-server connection, with the elements that were in the client namespace now qualified by the server namespace. This brings the following requirements:

  • Partial DOM structures, whether representing a complete or part of an XML Stanza, must be serializable independently of the stream they were received from, if applicable.
  • It should be possible to strip a partial DOM structure from the default namespace it is in (i.e. when serialized depend on the reigning default namespace that the snippet will be inserted in), or have that default namespace be changed recursively.
  • It should be possible to create elements to be in no default namespace. Note that this is different from an element being in the empty namespace. Again the use case is being able to depend on a future containing parent or ancestery element to complete the qualification of the element.

Additionally, XML Stanzas can contain documents with horribly defined semantics that require preservation of the used namespace prefixes, because they are used in character data or attribute values, too. Yes, SOAP and friends. This brings the additional requirement to be able to store the used prefixes in the DOM structure, so that when the partial DOM containing such XML moves to another stream, it will serialize similarly to the original.

As it is pretty hard, and virtually impossible when Xish was first created, to satisfy these requirements with existing XML libraries, Domish came to be. Providing the DOM with a different API like ElementTree, is likely to be possible, but I'm not sure if it is rich enough for the requirements mentioned above.

Further, Xish contains EventDispatcher, that is the bases of XmlStream objects. This event dispatcher allows for setting up observers of incoming XML stanzas, based on XPath-like expressions. An XmlStream will hand incoming data over to a domish.elementStream for parsing. For every complete Stanza, it will call the dispatch method of EventDispatcher with the partial DOM of that Stanza. It will match the DOM to the XPath like expression, and if there is a match, the assigned observers will be called with the DOM.

I suppose that the existance of a custom XPath-like implementation is to be able to work with the custom DOM structures defined in domish, and because typical XMPP processing doesn't really need a full-fledged XPath implementation. Jack has mentioned before that this XPath implementation is a lot faster than the full-fledged one(s?) he tried. For some uses, though, it would be very nice if there was a more complete XPath implementation here, for example to validate incoming Stanzas or extract data from them.

The XmlStream implementation in Xish is a generalized version of the Jabber specific one in twisted.words.protocols.jabber.xmlstream. I used it successfully with the now-defunct FeedMesh service and some private projects.

comment:4 Changed 10 years ago by Glyph

Wow. Thanks for the extremely detailed response, ralph!

comment:5 in reply to:  3 ; Changed 10 years ago by jack

  • Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.

This seem to contradict the tests in test_domish. If no default namespace is used, localPrefixes will cause a toplevel <bar:foo xmlns:bar='somens'/> instead of <foo xmlns='somens'/>.

A similar test (perhaps the same one?) wants the child of this to be unprefixed. But if there was not default namespace for the first stanziq, why has it changed for the child?

comment:6 in reply to:  5 Changed 10 years ago by ralphm

Replying to jack:

  • Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.

This seem to contradict the tests in test_domish. If no default namespace is used, localPrefixes will cause a toplevel <bar:foo xmlns:bar='somens'/> instead of <foo xmlns='somens'/>.

I think there is no contradiction.

The quoted text comes from XMPP Core, RFC 3920, section 11.2.2. The term default namespace here, applies to the default namespace of the stream element (e.g. jabber:client). Since we render logical childs of the stream element independently of the stream element's start tag, domish has some provisions to work in this setting.

By default, elements will not be rendered with a namespace prefix, to comply with that section. First of, elements (whatever depth/distance from the stream element) SHOULD NOT use the prefix notation:

>>> Element((None, 'message')).toXml()
u'<message/>'
>>> Element(('urn:ietf:params:xml:ns:xmpp-tls', 'starttls')).toXml()
u"<starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>"

Second, child elements of the stream MUST NOT use the prefix notation if the default namespace of the stream is jabber:client or jabber:server. Well, this is satisfied by the above, but will get you an superfluous xmlns='jabber:client', because the serializer doesn't know the default namespace of the stream. So we tell that to the serializer:

>>> Element(('jabber:client', 'message')).toXml(defaultUri='jabber:client')
u'<message/>'

As far I can see there is no test case for passing defaultUri, which is unfortunate. Addressing the ugliness is not a real requirement, though, and my code generates elements that eventually go out on client or server streams with None as the namespace, and converts incoming stanzas such that the elements in the stream's default namespace go to None, too. This significantly eases transition of stanzas that come from clients (jabber:client) to other servers (jabber:server).

The localPrefixes argument to Element is for storing prefix declarations on this particular element. This is useful in situations where you want to define a prefix for child elements or attributes. An example is the dialback namespace:

>>> stream = Element(('http://etherx.jabber.org/streams', 'stream'),
...                  'jabber:server',
...                  localPrefixes={'db': 'jabber:server:dialback'})
>>> stream.toXml()
u"<xn0:stream xmlns:xn0='http://etherx.jabber.org/streams' xmlns='jabber:server' xmlns:db='jabber:server:dialback'/>"

Here you see that because the element's namespace is different from the default namespace, a prefix is needed and generated for us. To supply a prefix explicitly we can do:

>>> stream.toXml(prefixes={'http://etherx.jabber.org/streams': 'stream'})
u"<stream:stream xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:server' xmlns:db='jabber:server:dialback'/>"

To now make the actual dialback exchange use prefix notation, we do something like this in the send method of XmlStream:

>>> result = Element(('jabber:server:dialback', 'result'))
>>> prefixes = {'jabber:server:dialback': 'db'}
>>> result.toXml(prefixes=prefixes, prefixesInScope=prefixes.values())
u'<db:result/>'

A similar test (perhaps the same one?) wants the child of this to be unprefixed. But if there was not default namespace for the first stanziq, why has it changed for the child?

I think you are referring to testLocalPrefixesWithChild, and there <baz/> does not have a namespace or prefix because the default namespace of the parent element is the empty namespace ('').

comment:7 Changed 10 years ago by radix

Keywords: review added

I'm putting this ticket into review because there is (what seems to be) a thorough justification in the comments of this ticket. Someone should thoroughly review it.

comment:8 Changed 10 years ago by radix

Owner: ralphm deleted

comment:9 Changed 10 years ago by Glyph

Keywords: review removed
Owner: set to ralphm

OK. This makes a lot more sense to me now - or at least, the Twisted implementation does. Why the XMPP standardization effort would go to such deliberate and meticulous effort to destroy any possible value they got out of using XML, an existing "standard", by requiring all kinds of non-standard behavior, still escapes me.

The one thing I think is still questionable is the custom DOM implementation. Everything I've seen here seems like it could be handled by a custom serializer working from microdom, minidom, or ElementTree - or heck, eventually, all 3.

My suggestion for handling the namespace wonkiness would be to create a specific internal-to-twisted.words namespace, let's say http://twistedmatrix.com/ns/twisted.words/xmpp-no-namespace, that would be treated specially by the custom serializer, associating with the containing namespace.

In fact, with such a namespace and a little glue code, it seems to me that we wouldn't even need a custom serializer, just some code that was smart about working with Element objects representing DOM fragments, and walking over them to properly normalize namespaces when inserting one into another.

I therefore propose that we close this ticket and file another one for deleting the DOM implementation in xish.

Thoughts?

comment:10 Changed 8 years ago by Itamar Turner-Trauring

Cc: itamarst added

Not sure I'm qualified to say "for the love of god, yes please", but I can certainly try to get people to answer.

So, is glyph's proposal reasonable? Please don't force me to learn obscure details about XML namespaces and how XMPP abuses them.

comment:11 Changed 8 years ago by khorn

this might be relevant to the discussion: http://pyxmpp.jajcus.net/trac/ticket/38

comment:12 Changed 8 years ago by khorn

OK, so I've been over this ticket a number of times, and it's still not very clear to me what everyone's position is on this. Ralph provides a glut of information, which is a bit of a handful to digest, so please correct me if I've misunderstood anything.

First, I have to say that if this can be done in such a way that it reduces the size of the twisted.words codebase, it probably should be. lxml seems to me to be a good bet for at least the core of a replacement. It looks (at least to my untrained eye) that it can handle stream-style parsing fine, and I think that it shouldn't be impossible to create a serialization interface that would work (maybe using a custom lxml/etree Element class would help here).

I'm less sure about the namespacing issues, but it seems likely that something could be worked out.

However, replacing everything would probably be a pretty massive undertaking, so Glyph's proposal to just replace/delete the DOM stuff is pretty reasonable. I do think we should keep it based on the etree Element class though, and it would be nice if it could be compatible with lxml's Element class as well.

I also think that we should consider removing more of this stuff down the road, if possible.

Well, you asked for thoughts... :)

comment:13 in reply to:  12 ; Changed 8 years ago by Glyph

Replying to khorn:

OK, so I've been over this ticket a number of times, and it's still not very clear to me what everyone's position is on this. Ralph provides a glut of information, which is a bit of a handful to digest, so please correct me if I've misunderstood anything.

My position is that I'd like xish to go away. I think it creates a maintenance burden in an area that we don't have many (well, right now, it seems like "any") maintenance cycles. The only reason to keep it is if, well, we have to keep it. I am increasingly convinced that it is not, in fact, indispensable.

Sadly, removing it is, itself, maintenance, especially given that we will need some kind of compatibility or transition layer to move to the new code.

First, I have to say that if this can be done in such a way that it reduces the size of the twisted.words codebase, it probably should be. lxml seems to me to be a good bet for at least the core of a replacement. It looks (at least to my untrained eye) that it can handle stream-style parsing fine, and I think that it shouldn't be impossible to create a serialization interface that would work (maybe using a custom lxml/etree Element class would help here).

OK, so let me play devil's advocate for a moment...

Python 2.4 doesn't include 'xml.etree' on its own, so it seems like no matter what we're probably going to be adding a dependency.

If we add lxml, this is going to create build and install problems for people who probably don't have build and install problems yet; see <http://codespeak.net/lxml/build.html#building-lxml-on-macos-x> and <http://codespeak.net/lxml/build.html#static-linking-on-windows>, in particular.

I also don't love that we would be taking bytes off the wire and shoving them into a giant, opaque mass of C code: <http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=libxml2>. It would be nice to have a pure-python option for those who care more about security than performance. To be fair, if we are just talking about the DOM at first, that's at least one layer removed from the network.

If we go with ElementTree, there are apparently significant behavior differences between cElementTree and ElementTree (and, for that matter, xml.etree and lxml.etree). I don't know a lot about those, and I'd love it if someone could just tell me that I'm wrong about that and we can depend on a standard API from ElementTree and switch out parser implementations.

I don't think that any of these things should stop us, but they should be taken into account.

I'm less sure about the namespacing issues, but it seems likely that something could be worked out.

If you're just talking about replacing the DOM, then the namespacing issues go away: they're a serialization problem.

However, replacing everything would probably be a pretty massive undertaking, so Glyph's proposal to just replace/delete the DOM stuff is pretty reasonable. I do think we should keep it based on the etree Element class though, and it would be nice if it could be compatible with lxml's Element class as well.

When you say "keep it" based on the etree class, what are you referring to? It's currently its own custom thing, since it doesn't depend on etree.

I also think that we should consider removing more of this stuff down the road, if possible.

+1.

Well, you asked for thoughts... :)

Nothing is really going to happen on this ticket until someone steps forward to do the work. Are you volunteering? Please say yes :).

comment:14 in reply to:  13 Changed 8 years ago by khorn

Replying to glyph:

Replying to khorn:

OK, so I've been over this ticket a number of times, and it's still not very clear to me what everyone's position is on this. Ralph provides a glut of information, which is a bit of a handful to digest, so please correct me if I've misunderstood anything.

My position is that I'd like xish to go away. I think it creates a maintenance burden in an area that we don't have many (well, right now, it seems like "any") maintenance cycles. The only reason to keep it is if, well, we have to keep it. I am increasingly convinced that it is not, in fact, indispensable.

Once again, we're on the same page.

Sadly, removing it is, itself, maintenance, especially given that we will need some kind of compatibility or transition layer to move to the new code.

First, I have to say that if this can be done in such a way that it reduces the size of the twisted.words codebase, it probably should be. lxml seems to me to be a good bet for at least the core of a replacement. It looks (at least to my untrained eye) that it can handle stream-style parsing fine, and I think that it shouldn't be impossible to create a serialization interface that would work (maybe using a custom lxml/etree Element class would help here).

OK, so let me play devil's advocate for a moment...

Python 2.4 doesn't include 'xml.etree' on its own, so it seems like no matter what we're probably going to be adding a dependency.

If we add lxml, this is going to create build and install problems for people who probably don't have build and install problems yet; see <http://codespeak.net/lxml/build.html#building-lxml-on-macos-x> and <http://codespeak.net/lxml/build.html#static-linking-on-windows>, in particular.

Well, I haven't had any trouble installing it on Win32, until I tried it just now, of course...sigh. (Ah, no Py2.6 binary release for the latest lxml...you have to specify the version manually) At any rate, this is a good point, but the situation on Windows isn't that bad. Even if you want to build from source, the lxml people provide all of the dependencies. It's certainly easier to find a 'binary' for lxml than it was for say PyCrypto up until a few months ago. (Thanks, Michael Foord!) Getting OpenSSL installed is also a much bigger hassle than lxml on Win32, in my opinion.

The Mac people will have to sound off for themselves, but sure this is an issue. I can't stand installing Python or Python C Extensions on Mac OS...it's a nightmare.

Ideally, I'd like to see an etree-based solution that could use lxml if available, otherwise fall back to ElementTree.

I also don't love that we would be taking bytes off the wire and shoving them into a giant, opaque mass of C code: <http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=libxml2>. It would be nice to have a pure-python option for those who care more about security than performance. To be fair, if we are just talking about the DOM at first, that's at least one layer removed from the network.

It's a fair cop, but society is to blame.

And I think, yes, let's do DOM first, then worry about other stuff later. Though where the "DOM stuff" ends and the "network stuff" begins may not be that clear. I'll have to dig into the code a bit deeper.

If we go with ElementTree, there are apparently significant behavior differences between cElementTree and ElementTree (and, for that matter, xml.etree and lxml.etree). I don't know a lot about those, and I'd love it if someone could just tell me that I'm wrong about that and we can depend on a standard API from ElementTree and switch out parser implementations.

lxml's etree implementation is advertised as being a superset of ElementTree's etree implementation. So far, in my experience, this is true, and the lxml documentation seems to do a relatively good job of pointing out which features they have added.

It looks like both ElementTree and lxml have a somewhat similar target/feed parser interface...so I think that area is covered. A little glue might be required, but not much, I think.

I don't think that any of these things should stop us, but they should be taken into account.

I'm less sure about the namespacing issues, but it seems likely that something could be worked out.

If you're just talking about replacing the DOM, then the namespacing issues go away: they're a serialization problem.

Yeah, well, that's kind of what I thought, but I wasn't entirely sure.

However, replacing everything would probably be a pretty massive undertaking, so Glyph's proposal to just replace/delete the DOM stuff is pretty reasonable. I do think we should keep it based on the etree Element class though, and it would be nice if it could be compatible with lxml's Element class as well.

When you say "keep it" based on the etree class, what are you referring to? It's currently its own custom thing, since it doesn't depend on etree.

"keep it based on" = "restrict our efforts to a solution based on"

I also think that we should consider removing more of this stuff down the road, if possible.

+1.

Well, you asked for thoughts... :)

Nothing is really going to happen on this ticket until someone steps forward to do the work. Are you volunteering? Please say yes :).

Er...well, it's a topic I'm interested in and I'd like to do some work on it, but considering I'm having a rough time finding time to work on the tickets already assigned to me....

maybe...after a bit...if noone else does it first

comment:15 Changed 8 years ago by jknight

Aiming for portability across etree implementations would have some issues. For example, only lxml has a real XPath implementation. There is certainly a portable core API, but, it might not be featureful enough to implement what is needed. For example, I believe ElementTree has no way to preserve the namespace prefix when parsing, and upon serialization, always serializes with prefixes "ns0", "ns1", etc.

If I were implementing this change, I'd certainly want to just assume lxml. And maybe if someone really wants to, they could make an including-more-lxml-features wrapper around ElementTree later...

The URLs you mention actually makes me come to the opposite conclusion that you did: lxml has good documentation on how to build even on crappy platforms that try to make things hard for you, so there should be no issues for users. :) It'd be ideal if the lxml pypi had a precompiled version available for OSX, of course.

comment:16 in reply to:  15 Changed 8 years ago by Glyph

Replying to jknight:

Aiming for portability across etree implementations would have some issues. For example, only lxml has a real XPath implementation. There is certainly a portable core API, but, it might not be featureful enough to implement what is needed. For example, I believe ElementTree has no way to preserve the namespace prefix when parsing, and upon serialization, always serializes with prefixes "ns0", "ns1", etc.

It's because of issues like these that I wanted to separate "switching DOM API" and "switching parser" into separate steps. Since I believe namespace-preservation is a non-standard feature, we will certainly need parser-specific hacks.

If I were implementing this change, I'd certainly want to just assume lxml. And maybe if someone really wants to, they could make an including-more-lxml-features wrapper around ElementTree later...

Easier said than done :).

The URLs you mention actually makes me come to the opposite conclusion that you did: lxml has good documentation on how to build even on crappy platforms that try to make things hard for you, so there should be no issues for users. :) It'd be ideal if the lxml pypi had a precompiled version available for OSX, of course.

Well, the URLs strongly imply "if you do the default things, you will encounter errors, here are the flags which mean "actually work"". And some of them, like 'STATIC_DEPS=true easy_install ...' won't work for Twisted directly, because that will possibly build other C library dependencies (like PyOpenSSL) with static dependencies, which might not be what you want. So we'll need to copy and paste stuff from those URLs and refer to them, and still deal with people showing up when a vanilla 'setup.py' or 'easy_install' fails.

Still, I guess I generally agree with you that the situation isn't that bad.

comment:17 Changed 8 years ago by <automation>

Owner: ralphm deleted
Note: See TracTickets for help on using tickets.