Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand previously-opened-element check to include all currently-open elements #4

Merged
merged 4 commits into from
Oct 31, 2022
Merged

Expand previously-opened-element check to include all currently-open elements #4

merged 4 commits into from
Oct 31, 2022

Conversation

jayaddison
Copy link

Could resolve #3 - also has the potential to reduce performance, though.

assert "".join(body.itertext()) == "sample text content"

def test_text_iterator_unclosed_tag():
html = "<html><body><span>hello <div>to <div>the <div>world!</span></body><footer>unrelated</footer></html>"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test coverage is intentionally designed not to make any assumptions about the tree/node structure built in-memory by the parser.. because I'm not sure about the implications yet.

(for example: div<div<div>>> (nested nodes) is one possible in-memory representation, and div<>div<>div<> (sibling nodes) is another) -- either of those produce the same itertext results in this test case, but it's possible to imagine representations where descendent text content would be different depending on the parse tree)

Copy link
Author

@jayaddison jayaddison Oct 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and it turns out, the tree structure produced by lxml when 'implied end tags' are generated depends on the elements encountered! (by design)

A series of unclosed div elements will turn into nested nodes, whereas a series of unclosed li items will result in sibling nodes.

It's all spec'd in the whatwg HTML5 spec, but it's kinda complicated involved (clarify: not complicated.. a lot of rules, though). Both the current node and the previous node are relevant when deciding how to emit implied closing tags, and there are custom rules for various different elements.

Uncertain what to do here - I like that this library is simple and fast. The reason I'm contributing is to see whether ScrapingHub's extruct library could use htmlement instead of lxml, allowing me to build pure-Python containers for openculinary.

Recent context here: scrapinghub/extruct#163 (comment) (and longer story earlier in that same issue)

In short: some of the test cases in extruct relate to tree-order traversal of parsed HTML nodes for text content retrieval, which is what led me to all this.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is a bit of an issue alright. While Htmlement's approach is fast, it's also very naive, deliberately so.
I can add more logic to catch some of the more common closing tag issues, but it would never be fully compliant.

@willforde
Copy link
Owner

Hello. This is a good change, thanks. There shouldn't be much of a performance hit with this change, as it shouldn't be hit too often.

I didn't plan for more than one malformed end tag, that was a bit short sighted on my part. Thanks again.

@jayaddison
Copy link
Author

You're welcome, glad to contribute - this seemed to make one small step towards migration away from lxml in extruct (a library that openculinary uses, and that I'd like to see freed of non-Python dependencies).

As commented above, I'm not sure whether that migration is practical or straightforward after all -- not a fault of htmlement - seems tricky to get the text fragment ordering logic for HTML5 correct.

@willforde willforde merged commit 3af84f6 into willforde:master Oct 31, 2022
@jayaddison jayaddison deleted the issue-3/previously-closed-tag-check branch October 31, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential bug: implicitly-closed tag lookup only considers a single previous element
2 participants