[Twisted-Python] Re: Contributing?

James Y Knight foom at fuhm.net
Thu Aug 26 15:07:35 EDT 2004


On Aug 26, 2004, at 2:15 PM, Nicola Larosa wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>> There are a variety of other Python HTML parsers, but from what I can
>> tell, they're even worse than microdom is. It'd be way cool to have a
>> python HTML parser that actually works.
>
> People say nice things about Beautiful Soup:
>
> http://www.crummy.com/software/BeautifulSoup/

Unfortunately, it's trying to solve a completely different problem. It 
is not to hoping to make a tree of the entire document, but rather, to 
do something like "give me all the hrefs on the page". As such, it 
doesn't even *try* to parse html properly, it just knows enough to be 
able to ignore the parts of the page you aren't asking for.

Its intro says:
> A well-formed HTML document will yield a well-formed data
> structure. An ill-formed HTML document will yield a correspondingly
> ill-formed data structure. If your document is only locally
> well-formed, you can use this to process the well-formed part of it.

However, that is not entirely accurate, unless "well formed" doesn't 
mean "follows the HTML4 standard". It doesn't parse 
"<table><tr><td>foo<tr><td>bar</table>" correctly -- a perfectly valid 
bit of HTML4. Microdom's goal is to yield a well-formed data structure 
from a well-formed HTML document, and most ill-formed HTML documents 
too.

James





More information about the Twisted-Python mailing list