Opened 12 years ago

Last modified 5 years ago

#3561 task new

deprecate microdom

Reported by: Jean-Paul Calderone Owned by:
Priority: normal Milestone:
Component: web Keywords:
Cc: Glyph Branch:
Author:

Description

In the olden days, there were lots of XML parsers, and they were bad. Now there's even more XML parsers, and they're not so bad. But microdom is still bad. Now it's probably worse than most other XML libraries. Plus, we're not going to fix the obvious bugs in it (eg #414). People shouldn't use it, so it should be deprecated.

twisted.web.html uses microdom, so it should also be deprecated probably (it's not an intensely useful module anyway). Anything else in Twisted which uses microdom (hopefully, not many things) should stop or also be deprecated (some things in domhelpers use it, for example).

Change History (16)

comment:1 Changed 12 years ago by Jean-Paul Calderone

Cc: Glyph added

Glyph suggested preserving the potentially still useful lenient parsing mode. We should take a look at other libraries for that to see if microdom is still the best. If it is, we can preserve it by offering a new API solely for that purpose that uses sux and some parts of microdom but which uses the DOM classes from minidom instead of the microdom DOM classes.

One other library for this is tidy. Glyph, do you know of any examples where tidy falls over but microdom does alright?

comment:2 Changed 12 years ago by Glyph

Tidy does something slightly different.

Consider the example on the site for the Python bindings to libtidy, utidylib. It doesn't just leniently parse some XML and give you a DOM structure: it parses it and reformats it as a valid HTML document (string), complete with doctype. If you want a DOM, you have to parse it again anyway.

Microdom's extremely lenient parsing mode can be useful for parsing other kinds of ill-formed or mangled XML documents, or inspecting almost-but-not-quite valid XHTML without completely rewriting it.

Consider also this curious declaration from python-utidylib's author: "There are no plans to support other features of TidyLib, such as document-tree traversal, since Python has several quality DOM implementations. (The author uses Twisted's implementation, twisted.web.microdom)." This bolsters my very vague impression that microdom is still valuable to somebody for something and maybe we should keep maintaining it, but I am really not current on any of the XML issues in the Python world.

I guess my main fear is that if we get rid of microdom I will suddenly start learning things about deficiencies of other peoples' DOM implementations again, and I won't like what I find out :).

That said I have no good, rational reason to oppose this decision, and I am in fact in favor of it, since it would be a waste of time to maintain an XML library when other people are doing this better. The situation has stabilized somewhat from the time of microdom's origins (the python development team breaks stuff other than XML APIs now). I'm just not really sure about it. Maybe some day this week I'll work up the energy to go and compare its current state to minidom again.

comment:3 Changed 12 years ago by jknight

Microdom is not the best, in fact I'd say its lenient mode is really rather poor, especially if you attempt to use it on HTML, which is what most people want to do. The state-of-the art for HTML python parsers is html5lib. http://code.google.com/p/html5lib/

Despite its name, it does have a kindasortamostly-XML mode too, see the docs.

comment:4 Changed 12 years ago by Glyph

Counterpoint: it's totally the best. What criteria are we using to evaluate "bestness"?

HTML5Lib looks like a much better library than Tidy in general, but its function is roughly the same:

>>> import html5lib
>>> p = html5lib.HTMLParser()
>>> p.parse("<html>Hi!")
<<class 'html5lib.treebuilders.simpletree.Document'> None>
>>> _.toxml()
'<html><head/><body>Hi!</body></html>'

This is helpful if you're trying to get from an invalid HTML document to a valid document, but less helpful if you're trying to get from a non-well-formed XML document to a well-formed one.

Since the uses for non-HTML XML are pretty obscure (only an XMPP fan would tell you otherwise :)) the more specific use-case I have in mind is sanitizing a fragment of an (X)HTML document.

Microdom might not be ideal for this use-case, either, but it is at least obvious how to use it. I could easily be convinced that some usage of html5lib would be better for this; in fact, I already suspect it is the case. However, I use software that uses microdom to parse and display aggressively terrible HTML fragments such as spam message bodies about a dozen times a day, and have been for over a year at this point. I don't think that it's "really rather poor", and given the fairly low maintenance overhead of microdom I'm not inclined to rewrite all that software for some marginal improvement. Giving pointers to other libraries that do similar stuff (and may be better for your use-cases) might be a better idea than just dropping it.

comment:5 Changed 12 years ago by Glyph

Whoops! Closing my browser, I noticed that the non-specific "see the documentation" link lead me here: http://wiki.whatwg.org/wiki/HTML5Lib but I think you actually meant here: http://code.google.com/p/html5lib/wiki/UserDocumentation

Looks like this is actually what I wanted to do:

>>> x = html5lib.XMLParser()
>>> x.parse("<stuff>hi")
<<class 'html5lib.treebuilders.simpletree.Document'> None>
>>> _.toxml()
'<stuff>hi</stuff>'

comment:6 Changed 12 years ago by Glyph

One point in sux's favor is that it can parse a stream leniently and generate events as it is doing so. Per html5lib's documentation: "The WHATWG spec is not very streaming-friendly as it requires rearrangement of subtrees in some situations. However html5lib allows SAX events to be created from a DOM tree using html5lib.treebuilders.dom.dom2sax."

Of course this doesn't really help microdom at all, since microdom is a dom :).

comment:7 Changed 12 years ago by Glyph

Another point in microdom's favor, that JP discovered:

>>> from xml.dom.minidom import parseString
>>> parseString("<hi>hello<hi></hi>")
 ...
xml.parsers.expat.ExpatError: no element found: line 1, column 18
>>> from twisted.web.microdom import parseString
>>> parseString("<hi>hello<hi></hi>")
 ...
twisted.web.microdom.MismatchedTags: expected </<hi ('<xmlfile />' line 1 column 4) >...</hi>>, got </END_OF_FILE> line: 1 col: 18, began line: 1 col: 4
>>> 

Notice that microdom tells you a lot more: there are mismatched tags, not just "no element", and where the mismatched tag began. This information can be really helpful when editing XML documents by hand.

comment:8 Changed 11 years ago by Glyph

Milestone: Twisted-8.2+1
Priority: highestnormal

I don't think this should block the release, and I don't think it's actually particularly high priority.

(Also I have some other thoughts, which I will attach momentarily.)

comment:9 Changed 11 years ago by Glyph

I'd like to summarize the consequences of the issues reported above so that this deprecation does not become a regression.

Right now, it's fairly convenient to write a twisted.web application which parses some XML from a "REST" API and processes it, without unexpectedly opening and blocking on random sockets that grab DTDs, and with helpful error-reporting of malformed XML, and/or optional lenient acceptance of malformed XML.

I think this is a pretty common use-case. Deprecating microdom right now will make it more difficult, since the default behavior of microdom's parser is to grab DTDs from random places, and turning that off is difficult. Now, one could argue that lenient XML parsing is not particularly valuable, but the parsing ought to be a separable concern.

Unfortunately there are actually two parsers: twisted.web.sux.XMLParser, and twisted.web.microdom.MicroDOMParser - which would be better titled CruddyXMLParser and CruddyHTMLParser. Even more unfortunately, some of the features in MicroDOMParser - for example, shouldPreserveSpace - really belong in XMLParser.

I think that James makes a compelling case that html5lib is really where it's at for parsing and scrubbing crummy HTML; in the long term, I think we can safely abandon that particular use-case and leave it up to the generally better-qualified html5lib folks.

That leaves the actual DOM implementation in twisted.web.microdom. 90% of this is a trivial re-implementation of a bunch of garbage that microdom handles particularly nicely. The one thing that I can see which might have some value is Element.writexml.

The use-case for that code is this: you've grabbed some XML from some API or XMPP stream or whatever, and you've got a DOM node, which you want to stick into an XHTML document. As described in the ever-informative "XHTML considered harmful", you are sometimes stuck in the situation of generating a document which you (the generator) knows will be valid XHTML, but you aren't necessarily going to have the appropriate help from your web server to set the MIME type properly, so it's going to be processed with an SGML parser. So everything will line up nicely and even operate in "standards compliance" mode (ha ha ha) as long as you obey the "ALLOWSINGLETON" list and don't generate anything that looks like e.g. "<span />". I realize this use-case is really crappy, but it's an ugly reality I've had to deal with a number of times, and I'd hate to lose that code. (Unless there's an obviously, uniformly better way to deal with this, of course.)

So, with those goals outlined, here are the steps I think should be taken:

  1. Deprecate twisted.web.microdom.lmx. Holy crap, I hope nobody is using that. I think we can all agree it's worthless.
  2. Move the invalid-XML parsing features from MicroDOMParser into XMLParser.
  3. Make XMLParser use a pluggable DOM so that we can do our crappy semi-XML parsing into xml.dom.minidom objects rather than having our own DOM.
  4. Create a new module, twisted.web.parsexml, to hold stuff related to parsing XML. (Maybe all of this stuff should just go into sux, though. The interesting point is that it all needs to get out of microdom.)
  5. Move twisted.lore.tree.parseFileAndReport into twisted.web.parsexml so that you can use the standard library, standards-compliant XML parser with good error-reporting and not-blocking-on-DTD-sockets behavior in twisted.web applications.
  6. If my rationale for the XHTML-and-HTML use-case above is valid, rewrite twisted.web.microdom.Element.writexml as a free function that operates on standard DOM objects; put it into twisted.web.parsexml.
  7. Move microdom.MicroDOMParser to parsexml.TagSoupParser or similar, to make it clear what its utility is, and deprecate the old name.
  8. Deprecate everything remaining in microdom (i.e. Element, Document... the actual DOM classes).

comment:10 Changed 11 years ago by Jean-Paul Calderone

since the default behavior of microdom's parser is to grab DTDs from random places

This is the default behavior of xml.sax, not xml.dom. Someone running xml.dom.minidom.parseString(foo) is at no risk of unexpected network access in any existing version of Python, as far as I know.

comment:11 Changed 11 years ago by jknight

The one thing that I can see which might have some value is Element.writexml

Serialization of HTML/XHTML is also handled by html5lib. See "Serialization of Streams" in the docs. Here's a simple modification of the example which parses the text as a fragment, and serializes as XHTML (with proper html compatibility):

import html5lib
from html5lib import treebuilders, treewalkers, serializer
from html5lib.filters import sanitizer
#
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
walker = treewalkers.getTreeWalker("dom") 
s = serializer.xhtmlserializer.XHTMLSerializer(omit_optional_tags=True)
#
dom_tree = p.parseFragment('<a target="foo"></a><img src="foo">')
''.join(s.serialize(walker(dom_tree)))

Results in: u'<a target="foo"></a><img src="foo" />'

comment:12 in reply to:  10 Changed 11 years ago by Glyph

Replying to exarkun:

since the default behavior of microdom's parser is to grab DTDs from random places

This is the default behavior of xml.sax, not xml.dom.

Just for future reference, it appears that the offending code is in xml.dom.xmlbuilder.DOMEntityResolver.resolveEntity and xml.sax.saxutils.prepare_input_source - it's a bit hard to trace through the code, so I'm not sure which one is invoked in which circumstance. In other words, as far as I can tell, it's not a difference between xml.sax and xml.dom, so much as a difference between the default expatbuilder and the more general xmlbuilder.

Someone running xml.dom.minidom.parseString(foo) is at no risk of unexpected network access in any existing version of Python, as far as I know.

Interesting. This seems to be true as long as you don't pass a parser object ;-). In that case I guess the thing we need is something that can create a parser object like the one created in parseFileAndReport?

I'm curious now why is it that you went with xml.sax for parsing in lore rather than subclassing ExpatBuilderNS for the error reporting or something.

comment:13 Changed 9 years ago by <automation>

Owner: jknight deleted

comment:14 Changed 5 years ago by Adi Roiban

Resolution: duplicate
Status: newclosed

I think that this was fixed in #6079 as microdom is no longer used

comment:15 Changed 5 years ago by Adi Roiban

Resolution: duplicate
Status: closedreopened

my bad... microdom is not yet deprecated

comment:16 Changed 5 years ago by Adi Roiban

Status: reopenednew

I think that before deprecating it we need to migrate current twisted code away from microdom. I have created a separate ticket for that #7943

Note: See TracTickets for help on using tickets.