Ticket #3201 (new enhancement )

Opened 1 year ago

Last modified 4 months ago

thoroughly justify xish's existence, or decide to remove it

Reported by: glyph Assigned to: ralphm
Type: enhancement Priority: normal
Milestone: Component: words
Keywords: Cc: glyph, jknight, ralphm
Branch: Author:
Launchpad Bug:

Description

On the one hand, XMPP fans (ralphm, dizzyd, and jack, I believe) suggest that xish is inherently superior for processing XMPP for some reason which I don't yet understand.

On the other hand, Twisted core folks (glyph, exarkun, and itamar, among others) believe that maintaining two mostly-incompatible XML libraries is bad. microdom, at least, has the excuse that it is an implementation of an external specification; xish is its own special API. Our collective suggestion / assumption is that eventually we should replace xish with lxml, which is emerging as the best XML processing library for Python, and has features required for XMPP (like xpath) built in.

I'm making this ticket to keep a record of the discussion. I will close it when we've all had a chance to hear both sides of the issue (and I'll make a decision to file another ticket describing the implementation of the solution). I'm assigning it to ralphm first so that we can get the story on what makes xish particularly good for XMPP and jabber, and if (and why) all of lxml is inappropriate, or only certain parts.

Attachments

Change History

follow-up: ↓ 2   2008-04-22 17:17:49+00:00 changed by jknight

+1 to removing both xish and microdom as soon as possible. Maybe we can play a game: Twisted needs an XML library like it needs a .

However, unfortunately, I don't think lxml can do what xish does. Xish's existence was justified by it providing a streaming XML interface, both on the push side and the pull side. I don't believe you can do this today with lxml. I believe you can choose whether you want to push data in ("feed" interface) and get a single result, or iteratively pull data out ("iterparse") given a file-like-object, but not both. It'd be great to correct this defect (or just correct me if you can already do that!) in lxml, and then remove xish, though. :)

in reply to: ↑ 1   2008-04-22 18:12:57+00:00 changed by glyph

Replying to jknight:

I don't think lxml can do what xish does. Xish's existence was justified by it providing a streaming XML interface, both on the push side and the pull side.

Definitely some part of xish needs to continue to exist to do that. However, xish also provides a custom DOM implementation (including, among other things, a magical __getattr__) which seems peripheral to that goal; it could easily produce elementtree objects. This would also remove the need for its DOM serializer and xpath implementation (and Yapps grammar file), which, in addition to its DOM implementation, comprises over half of its code.

follow-up: ↓ 5   2008-04-23 07:59:13+00:00 changed by ralphm

  • cc set to glyph, jknight, ralphm

Here my somewhat lenghty summary of XMPP's XML processing requirements and survey of the different parts in Xish.

XML handling in XMPP differs from regular XML processing in a few ways, as defined in RFC 3920:

  • Stream oriented. Unlike regular XML processing, which is document oriented, XMPP traffic builds up one virtual XML document per direction. First a root element start tag will be sent (the XML stream header), and then the unit of communication is first level child elements ('XML Stanzas') of that root element. When the connection is dropped, if no closing tag for the root element was sent, it is assumed to have been sent implicitely.
  • XMPP only uses (and allows) a strict subset of XML. No processing instructions, no DTDs, no comments. Also, all character data or attribute values that contain characters that map to the predefined entities MUST be escaped.
  • Namespace handling:
    • The XML stream header MUST have a namespace declaration that binds to the streams prefix. The root element (<stream/>) and the <features/> and <error/> child elements MUST be qualified by this prefix.
    • The XML stream header MUST have a default namespace declaration for qualifying the first-level childs.
    • Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.
    • For 'dialback' there is a fixed prefix that, if used, MUST be declared on the XML stream header.

For parsing of incoming XML, a regular SAX-like parser can be used. The domish module uses Expat and falls back to twisted.web.sux. However, given the restrictions to XML handling given above, serialization of XML is more involved. In particular, XML libraries typically do not adhere to the namespace handling rules required by XMPP.

To process XML Stanzas, it is desirable to work with DOM like structures, but unlike document oriented XML processing, partial DOM like structures. A common pattern is to respond to incoming XML Stanzas by creating a new partial DOM for the response, often using parts of the incoming Stanza, and send the serialization of that partial DOM over the wire.

Another pattern, mostly for XMPP servers, is to take in a Stanza from one stream (e.g. a client) and send it on to another stream (e.g. a server-side component or a server). Client, server and component streams all use different default namespaces, that are very similar. I.e. a Stanza from a client stream in the default client namespace is mostly identical when sent over a server-to-server connection, with the elements that were in the client namespace now qualified by the server namespace. This brings the following requirements:

  • Partial DOM structures, whether representing a complete or part of an XML Stanza, must be serializable independently of the stream they were received from, if applicable.
  • It should be possible to strip a partial DOM structure from the default namespace it is in (i.e. when serialized depend on the reigning default namespace that the snippet will be inserted in), or have that default namespace be changed recursively.
  • It should be possible to create elements to be in no default namespace. Note that this is different from an element being in the empty namespace. Again the use case is being able to depend on a future containing parent or ancestery element to complete the qualification of the element.

Additionally, XML Stanzas can contain documents with horribly defined semantics that require preservation of the used namespace prefixes, because they are used in character data or attribute values, too. Yes, SOAP and friends. This brings the additional requirement to be able to store the used prefixes in the DOM structure, so that when the partial DOM containing such XML moves to another stream, it will serialize similarly to the original.

As it is pretty hard, and virtually impossible when Xish was first created, to satisfy these requirements with existing XML libraries, Domish came to be. Providing the DOM with a different API like ElementTree, is likely to be possible, but I'm not sure if it is rich enough for the requirements mentioned above.

Further, Xish contains EventDispatcher, that is the bases of XmlStream objects. This event dispatcher allows for setting up observers of incoming XML stanzas, based on XPath-like expressions. An XmlStream will hand incoming data over to a domish.elementStream for parsing. For every complete Stanza, it will call the dispatch method of EventDispatcher with the partial DOM of that Stanza. It will match the DOM to the XPath like expression, and if there is a match, the assigned observers will be called with the DOM.

I suppose that the existance of a custom XPath-like implementation is to be able to work with the custom DOM structures defined in domish, and because typical XMPP processing doesn't really need a full-fledged XPath implementation. Jack has mentioned before that this XPath implementation is a lot faster than the full-fledged one(s?) he tried. For some uses, though, it would be very nice if there was a more complete XPath implementation here, for example to validate incoming Stanzas or extract data from them.

The XmlStream implementation in Xish is a generalized version of the Jabber specific one in twisted.words.protocols.jabber.xmlstream. I used it successfully with the now-defunct FeedMesh service and some private projects.

  2008-04-23 13:52:56+00:00 changed by glyph

Wow. Thanks for the extremely detailed response, ralph!

in reply to: ↑ 3 ; follow-up: ↓ 6   2009-01-28 14:10:53+00:00 changed by jack

  • launchpad_bug deleted

* Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.

This seem to contradict the tests in test_domish. If no default namespace is used, localPrefixes will cause a toplevel <bar:foo xmlns:bar='somens'/> instead of <foo xmlns='somens'/>.

A similar test (perhaps the same one?) wants the child of this to be unprefixed. But if there was not default namespace for the first stanziq, why has it changed for the child?

in reply to: ↑ 5   2009-01-28 16:26:11+00:00 changed by ralphm

Replying to jack:

* Implementations MUST NOT generate namespace prefixes for elements in the default namespace for a fixed set of namespaces, and SHOULD NOT for all others.

This seem to contradict the tests in test_domish. If no default namespace is used, localPrefixes will cause a toplevel <bar:foo xmlns:bar='somens'/> instead of <foo xmlns='somens'/>.

I think there is no contradiction.

The quoted text comes from XMPP Core, RFC 3920, section 11.2.2. The term default namespace here, applies to the default namespace of the stream element (e.g. jabber:client). Since we render logical childs of the stream element independently of the stream element's start tag, domish has some provisions to work in this setting.

By default, elements will not be rendered with a namespace prefix, to comply with that section. First of, elements (whatever depth/distance from the stream element) SHOULD NOT use the prefix notation:

>>> Element((None, 'message')).toXml()
u'<message/>'
>>> Element(('urn:ietf:params:xml:ns:xmpp-tls', 'starttls')).toXml()
u"<starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>"

Second, child elements of the stream MUST NOT use the prefix notation if the default namespace of the stream is jabber:client or jabber:server. Well, this is satisfied by the above, but will get you an superfluous xmlns='jabber:client', because the serializer doesn't know the default namespace of the stream. So we tell that to the serializer:

>>> Element(('jabber:client', 'message')).toXml(defaultUri='jabber:client')
u'<message/>'

As far I can see there is no test case for passing defaultUri, which is unfortunate. Addressing the ugliness is not a real requirement, though, and my code generates elements that eventually go out on client or server streams with None as the namespace, and converts incoming stanzas such that the elements in the stream's default namespace go to None, too. This significantly eases transition of stanzas that come from clients (jabber:client) to other servers (jabber:server).

The localPrefixes argument to Element is for storing prefix declarations on this particular element. This is useful in situations where you want to define a prefix for child elements or attributes. An example is the dialback namespace:

>>> stream = Element(('http://etherx.jabber.org/streams', 'stream'),
...                  'jabber:server',
...                  localPrefixes={'db': 'jabber:server:dialback'})
>>> stream.toXml()
u"<xn0:stream xmlns:xn0='http://etherx.jabber.org/streams' xmlns='jabber:server' xmlns:db='jabber:server:dialback'/>"

Here you see that because the element's namespace is different from the default namespace, a prefix is needed and generated for us. To supply a prefix explicitly we can do:

>>> stream.toXml(prefixes={'http://etherx.jabber.org/streams': 'stream'})
u"<stream:stream xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:server' xmlns:db='jabber:server:dialback'/>"

To now make the actual dialback exchange use prefix notation, we do something like this in the send method of XmlStream:

>>> result = Element(('jabber:server:dialback', 'result'))
>>> prefixes = {'jabber:server:dialback': 'db'}
>>> result.toXml(prefixes=prefixes, prefixesInScope=prefixes.values())
u'<db:result/>'

A similar test (perhaps the same one?) wants the child of this to be unprefixed. But if there was not default namespace for the first stanziq, why has it changed for the child?

I think you are referring to testLocalPrefixesWithChild, and there <baz/> does not have a namespace or prefix because the default namespace of the parent element is the empty namespace ('').

  2009-01-28 19:43:12+00:00 changed by radix

  • keywords set to review

I'm putting this ticket into review because there is (what seems to be) a thorough justification in the comments of this ticket. Someone should thoroughly review it.

  2009-01-28 19:43:32+00:00 changed by radix

  • owner deleted

  2009-03-20 11:52:50+00:00 changed by glyph

  • keywords deleted
  • owner set to ralphm

OK. This makes a lot more sense to me now - or at least, the Twisted implementation does. Why the XMPP standardization effort would go to such deliberate and meticulous effort to destroy any possible value they got out of using XML, an existing "standard", by requiring all kinds of non-standard behavior, still escapes me.

The one thing I think is still questionable is the custom DOM implementation. Everything I've seen here seems like it could be handled by a custom serializer working from microdom, minidom, or ElementTree - or heck, eventually, all 3.

My suggestion for handling the namespace wonkiness would be to create a specific internal-to-twisted.words namespace, let's say http://twistedmatrix.com/ns/twisted.words/xmpp-no-namespace, that would be treated specially by the custom serializer, associating with the containing namespace.

In fact, with such a namespace and a little glue code, it seems to me that we wouldn't even need a custom serializer, just some code that was smart about working with Element objects representing DOM fragments, and walking over them to properly normalize namespaces when inserting one into another.

I therefore propose that we close this ticket and file another one for deleting the DOM implementation in xish.

Thoughts?

Note: See TracTickets for help on using tickets.