-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand previously-opened-element check to include all currently-open elements #4
Expand previously-opened-element check to include all currently-open elements #4
Conversation
tests/test_module.py
Outdated
assert "".join(body.itertext()) == "sample text content" | ||
|
||
def test_text_iterator_unclosed_tag(): | ||
html = "<html><body><span>hello <div>to <div>the <div>world!</span></body><footer>unrelated</footer></html>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this test coverage is intentionally designed not to make any assumptions about the tree/node structure built in-memory by the parser.. because I'm not sure about the implications yet.
(for example: div<div<div>>>
(nested nodes) is one possible in-memory representation, and div<>div<>div<>
(sibling nodes) is another) -- either of those produce the same itertext
results in this test case, but it's possible to imagine representations where descendent text content would be different depending on the parse tree)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and it turns out, the tree structure produced by lxml
when 'implied end tags' are generated depends on the elements encountered! (by design)
A series of unclosed div
elements will turn into nested nodes, whereas a series of unclosed li
items will result in sibling nodes.
It's all spec'd in the whatwg HTML5 spec, but it's kinda complicated involved (clarify: not complicated.. a lot of rules, though). Both the current node and the previous node are relevant when deciding how to emit implied closing tags, and there are custom rules for various different elements.
Uncertain what to do here - I like that this library is simple and fast. The reason I'm contributing is to see whether ScrapingHub's extruct
library could use htmlement
instead of lxml
, allowing me to build pure-Python containers for openculinary
.
Recent context here: scrapinghub/extruct#163 (comment) (and longer story earlier in that same issue)
In short: some of the test cases in extruct
relate to tree-order traversal of parsed HTML nodes for text content retrieval, which is what led me to all this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a bit of an issue alright. While Htmlement's approach is fast, it's also very naive, deliberately so.
I can add more logic to catch some of the more common closing tag issues, but it would never be fully compliant.
Hello. This is a good change, thanks. There shouldn't be much of a performance hit with this change, as it shouldn't be hit too often. I didn't plan for more than one malformed end tag, that was a bit short sighted on my part. Thanks again. |
You're welcome, glad to contribute - this seemed to make one small step towards migration away from As commented above, I'm not sure whether that migration is practical or straightforward after all -- not a fault of |
Could resolve #3 - also has the potential to reduce performance, though.