[Twisted-Python] Re: Contributing?
James Y Knight
foom at fuhm.net
Thu Aug 26 15:07:35 EDT 2004
On Aug 26, 2004, at 2:15 PM, Nicola Larosa wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>> There are a variety of other Python HTML parsers, but from what I can
>> tell, they're even worse than microdom is. It'd be way cool to have a
>> python HTML parser that actually works.
> People say nice things about Beautiful Soup:
Unfortunately, it's trying to solve a completely different problem. It
is not to hoping to make a tree of the entire document, but rather, to
do something like "give me all the hrefs on the page". As such, it
doesn't even *try* to parse html properly, it just knows enough to be
able to ignore the parts of the page you aren't asking for.
Its intro says:
> A well-formed HTML document will yield a well-formed data
> structure. An ill-formed HTML document will yield a correspondingly
> ill-formed data structure. If your document is only locally
> well-formed, you can use this to process the well-formed part of it.
However, that is not entirely accurate, unless "well formed" doesn't
mean "follows the HTML4 standard". It doesn't parse
"<table><tr><td>foo<tr><td>bar</table>" correctly -- a perfectly valid
bit of HTML4. Microdom's goal is to yield a well-formed data structure
from a well-formed HTML document, and most ill-formed HTML documents
More information about the Twisted-Python