Expand previously-opened-element check to include all currently-open elements #4

jayaddison · 2022-10-29T14:54:52Z

Could resolve #3 - also has the potential to reduce performance, though.

…urrent tag stack

jayaddison · 2022-10-29T21:42:37Z

tests/test_module.py

+    assert "".join(body.itertext()) == "sample text content"
+
+def test_text_iterator_unclosed_tag():
+    html = "<html><body><span>hello <div>to <div>the <div>world!</span></body><footer>unrelated</footer></html>"


Note that this test coverage is intentionally designed not to make any assumptions about the tree/node structure built in-memory by the parser.. because I'm not sure about the implications yet.

(for example: div<div<div>>> (nested nodes) is one possible in-memory representation, and div<>div<>div<> (sibling nodes) is another) -- either of those produce the same itertext results in this test case, but it's possible to imagine representations where descendent text content would be different depending on the parse tree)

...and it turns out, the tree structure produced by lxml when 'implied end tags' are generated depends on the elements encountered! (by design)

A series of unclosed div elements will turn into nested nodes, whereas a series of unclosed li items will result in sibling nodes.

It's all spec'd in the whatwg HTML5 spec, but it's kinda ~~complicated~~ involved (clarify: not complicated.. a lot of rules, though). Both the current node and the previous node are relevant when deciding how to emit implied closing tags, and there are custom rules for various different elements.

Uncertain what to do here - I like that this library is simple and fast. The reason I'm contributing is to see whether ScrapingHub's extruct library could use htmlement instead of lxml, allowing me to build pure-Python containers for openculinary.

Recent context here: scrapinghub/extruct#163 (comment) (and longer story earlier in that same issue)

In short: some of the test cases in extruct relate to tree-order traversal of parsed HTML nodes for text content retrieval, which is what led me to all this.

Yeah that is a bit of an issue alright. While Htmlement's approach is fast, it's also very naive, deliberately so.
I can add more logic to catch some of the more common closing tag issues, but it would never be fully compliant.

…e case

willforde · 2022-10-31T00:04:31Z

Hello. This is a good change, thanks. There shouldn't be much of a performance hit with this change, as it shouldn't be hit too often.

I didn't plan for more than one malformed end tag, that was a bit short sighted on my part. Thanks again.

jayaddison · 2022-10-31T00:25:29Z

You're welcome, glad to contribute - this seemed to make one small step towards migration away from lxml in extruct (a library that openculinary uses, and that I'd like to see freed of non-Python dependencies).

As commented above, I'm not sure whether that migration is practical or straightforward after all -- not a fault of htmlement - seems tricky to get the text fragment ordering logic for HTML5 correct.

jayaddison added 3 commits October 29, 2022 15:53

Add test coverage for existing expected behaviour

5f8a0c8

Add illustrative test case for unrelated-text collection issue

89e3170

Extend lookback for unclosed element tags to the cover the complete c…

a5c7867

…urrent tag stack

jayaddison commented Oct 29, 2022

View reviewed changes

Fixup: span (inline) within div (block) makes more sense as an exampl…

d8e4c43

…e case

willforde merged commit 3af84f6 into willforde:master Oct 31, 2022

jayaddison deleted the issue-3/previously-closed-tag-check branch October 31, 2022 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand previously-opened-element check to include all currently-open elements #4

Expand previously-opened-element check to include all currently-open elements #4

jayaddison commented Oct 29, 2022

jayaddison Oct 29, 2022

jayaddison Oct 30, 2022 •

edited

Loading

willforde Oct 31, 2022

willforde commented Oct 31, 2022

jayaddison commented Oct 31, 2022

Expand previously-opened-element check to include all currently-open elements #4

Expand previously-opened-element check to include all currently-open elements #4

Conversation

jayaddison commented Oct 29, 2022

jayaddison Oct 29, 2022

Choose a reason for hiding this comment

jayaddison Oct 30, 2022 • edited Loading

Choose a reason for hiding this comment

willforde Oct 31, 2022

Choose a reason for hiding this comment

willforde commented Oct 31, 2022

jayaddison commented Oct 31, 2022

jayaddison Oct 30, 2022 •

edited

Loading