-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163
Conversation
…xtras' extension for element content removal
Thanks for the PR @jayaddison , some notes: There are also some dependencies which also require lxml, e.g. html-text and probably some others - although html-text can be made optional, and maybe the syntaxes you care about are not affected.
From limited tests I did, html5lib is ~10 times slower than lxml.html (or other fast parsers), and parsing is a significant portion of extruct runtime, so dropping lxml support would be too much of a performance impact for performance-sensitive users (e.g. we do run extruct on high volumes of data, and probably others do as well). On the other hand, having it optional would be fine I think - although I'm not sure how feasible it is, given that some of our dependencies also need it.
My understanding that the goal of cleaning was not to ensure safety, but to produce cleaner output, although I'm not 100% sure. |
Thanks @lopuhin, that's good to confirm that performance could be a blocker for this.
Hopefully those other dependencies are no longer required;
Providing I did notice that there are already some abstractions over XML nodes/elements via the
That makes sense; extracting text is likely the primary concern for most users. That said, changes to content handling in libraries have a strange way of affecting downstream applications, and developers for those would want to be aware of and prepared for a potentially risky change like this. I think it'd be unlikely that one sanitization library would be conclusively 'better' or 'worse' in all situations: more likely that different situations would be handled by each one. Perhaps there are some good sources of test cases out there; I'll try to look into that soon. |
…titem__' access does not gracefully handle absent keys
Closing this pull request for now; if and when there are significant performance improvements available to |
Although I'm not planning another attempt to migrate away from I haven't attempted to use it for anything other than basic JSON-LD XPath queries, so I don't have a lot of information to share about its compatibility and performance compared to other interfaces, but it does seem to me like a path worth exploring. |
See also: https://github.com/openculinary/extruct/tree/htmlement-parser for a work-in-progress branch using Migrating the I'll also admit that this hasn't been tested on any datasets larger than the provided unit tests -- nor benchmarked -- so it's very much experimental. I'm optimistic about the parsing performance but don't yet have any data to provide. |
Thanks @jayaddison , that's really interesting, would love to hear about your progress 👍 |
Sure - glad to share a few findings. Migration of the A few notes so far:
It's hard to guess when I might next make progress on this, but I'll continue to share any updates and additional findings. |
@lopuhin I've managed to follow a couple of different paths to get either the Most property extraction seems more-or-less solved, except for And from experimentation with various approaches to text retrieval, it looks like the problem basically comes down tree transform/traversal to emit and combine the text tokens from the parse tree in an acceptable order. In the case of malformed HTML (ambiguous tree representation), I have a feeling that the tree ordering emitted by Either way I have a feeling that most of the work-in-progress in the branch is going to sit unchanged for a while longer. I'll probably next attempt a port of (and slightly off-topic: I've also been wondering whether HTML parsing is strictly necessary to achieve W3 |
I think I finally got close to the root (no pun intended) of the challenge with migrating text retrieval to Basically, it's that the HTML5 spec indicates different ways to generate implied closing element tags depending on the parsing context within input HTML. It's good that this is now well-specified. That said, The text content retrieval test cases in And I think that's where my work on this stops, for now :) |
Thank you for an update @jayaddison 👍 My 2c about some potentially related work on our side - having extruct work on a tree parsed by an html5-compliant parser is something we'd also want. So far for html parsing we've been using https://html5-parser.readthedocs.io/en/latest/ which uses gumbo parser under the hood and has an ability to emit |
This suggestion is derived from a specific use case where it'd be desirable to build pure-Python application containers for
openculinary/crawler
.Currently a dependency on
lxml
is an obstacle to this, sincelxml
useslibxml
andlibxslt
, and C bindings for those must be provided and compiled at build-time.There are some practical issues to resolve:
rdflib
that includes Discussion around new dict-based store implementation RDFLib/rdflib#1133 is available (in short, the default memory-backed RDF triple-store was not guaranteed order-preserving)There are questions about the impact of the migration and whether it's worthwhile:
extruct
for high volumes of data?bleach
andbleach-extras
compared tolxml
's text cleaning functionality?In short, it may not be worthwhile, but I spent a bit of time looking into the practicality of this and figured it's worth providing that even though it's work-in-progress.