Opened 13 years ago

Closed 11 years ago

#681 defect closed duplicate (duplicate)

[PATCH] Make microdom preserve boolean of "was there whitespace or not"

Reported by: Tv Owned by: spiv
Priority: high Milestone:
Component: web Keywords:
Cc: Cory Dodt, radix, Jean-Paul Calderone, spiv, itamarst, Tv, jknight, anglerud Branch:
Author:

Description


Attachments (2)

microdom-whitespace-boolean.diff (502 bytes) - added by Tv 13 years ago.
microdom-whitespace-boolean-2.diff (686 bytes) - added by Tv 13 years ago.

Download all attachments as: .zip

Change History (8)

Changed 13 years ago by Tv

comment:1 Changed 13 years ago by Tv

Currently, parsing things like
"foo &mdash; <span>bar</span>"
and re-outputting makes the output look like
"foo &mdash;<span>bar</span>"

Whereas _what_ whitespace there was isn't important, the fact that there _was_
whitespace is.

Changed 13 years ago by Tv

comment:2 Changed 13 years ago by Tv

In case you are interested, here's a less dramatic version that only activates
after entity references. I ended up using this as the original patch resulted in
spurious whitespaces in places where they earlier were eaten by microdom.

comment:3 Changed 13 years ago by jknight

Hrm, I thought I had a bug out on this before, apparently not. Anyhow, the current behavior of 
microdom is very broken -- it is not a proper XML parser OR HTML parser. 

To be a proper XML parser, it must never swallow whitespace, but replace multiple whitespace chars 
with one space. (this is easy to implement, and will usually be sufficient when outputting the parsed 
HTML to a HTML user-agent.)

To be a proper HTML parser, it is allowed to swallow whitespace sometimes, depending on the tag it's 
near (!), and otherwise coalesce it into one, unless it's inside a pre or textarea in which case it must 
leave it alone. (this is harder to implement)

This bug almost got fixed @pycon by implementing the XML parsing rules, but I guess lore depends on 
the broken behavior? Or maybe it just depends on the HTML whitespace rules, in which case someone
could implement those.

If you want to see code that implements the HTML rules, see perl's HTML::TreeParser and HTML::Tagset 
modules. If people are serious about parsing HTML, we *really* need a victim to copy the algorithms 
from Perl. Microdom is really a pretty poor HTML parser.

comment:4 Changed 13 years ago by hypatia

This bug looks like pretty much the same one as 571, so I've superseded 571 with
this one. Someone may want to have a look at the patch there though.

comment:5 Changed 13 years ago by radix

jknight, you indeed did post a bug about this: #414.

This bug sucks :-(

comment:6 Changed 11 years ago by Jean-Paul Calderone

Resolution: duplicate
Status: newclosed

The action is going to be on #414 for this.

Note: See TracTickets for help on using tickets.