Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

Closed
wants to merge 12 commits into from
Closed

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

wants to merge 12 commits into from

Conversation

jayaddison
Copy link
Contributor

This suggestion is derived from a specific use case where it'd be desirable to build pure-Python application containers for openculinary/crawler.

Currently a dependency on lxml is an obstacle to this, since lxml uses libxml and libxslt, and C bindings for those must be provided and compiled at build-time.

There are some practical issues to resolve:

  • Whitespace-related tests for W3C microdata are failing
  • Node-ordering-related tests for RDFA parsing are failing; this should be resolved once an upstream release of rdflib that includes Discussion around new dict-based store implementation RDFLib/rdflib#1133 is available (in short, the default memory-backed RDF triple-store was not guaranteed order-preserving)

There are questions about the impact of the migration and whether it's worthwhile:

  • What is the performance impact of the migration, particularly for any customers using extruct for high volumes of data?
  • What are the comparative safety properties of bleach and bleach-extras compared to lxml's text cleaning functionality?

In short, it may not be worthwhile, but I spent a bit of time looking into the practicality of this and figured it's worth providing that even though it's work-in-progress.

@lopuhin
Copy link
Member

lopuhin commented Dec 12, 2020

Thanks for the PR @jayaddison , some notes:

There are also some dependencies which also require lxml, e.g. html-text and probably some others - although html-text can be made optional, and maybe the syntaxes you care about are not affected.

What is the performance impact of the migration, particularly for any customers using extruct for high volumes of data?

From limited tests I did, html5lib is ~10 times slower than lxml.html (or other fast parsers), and parsing is a significant portion of extruct runtime, so dropping lxml support would be too much of a performance impact for performance-sensitive users (e.g. we do run extruct on high volumes of data, and probably others do as well). On the other hand, having it optional would be fine I think - although I'm not sure how feasible it is, given that some of our dependencies also need it.

What are the comparative safety properties of bleach and bleach-extras compared to lxml's text cleaning functionality?

My understanding that the goal of cleaning was not to ensure safety, but to produce cleaner output, although I'm not 100% sure.

@jayaddison
Copy link
Contributor Author

Thanks @lopuhin, that's good to confirm that performance could be a blocker for this.

There are also some dependencies which also require lxml, e.g. html-text and probably some others

Hopefully those other dependencies are no longer required; html-text can be dropped here (and I'll push some commits to do that shortly, just to demonstrate the possibility and finish up the path that was taken here). In practice, we'd want to figure out some of the other conversation around optional packages first. Speaking of which..

having it optional would be fine I think

Providing html5lib as an optional parser selection would be okay - in practice I tend to find the way that Python packaging deals with optional/substitute dependencies is less-than-ideal (and I worry that taking that route might cause other maintenance challenges in future). Perhaps with enough careful planning it'd be possible to manage the complexity there.

I did notice that there are already some abstractions over XML nodes/elements via the xmldom module (removed in this pull request currently). That could possibly handle most of the implementation details, but the dependency/packaging challenge would remain.

My understanding that the goal of cleaning was not to ensure safety, but to produce cleaner output, although I'm not 100% sure.

That makes sense; extracting text is likely the primary concern for most users. That said, changes to content handling in libraries have a strange way of affecting downstream applications, and developers for those would want to be aware of and prepared for a potentially risky change like this.

I think it'd be unlikely that one sanitization library would be conclusively 'better' or 'worse' in all situations: more likely that different situations would be handled by each one. Perhaps there are some good sources of test cases out there; I'll try to look into that soon.

@jayaddison jayaddison changed the title Idea: migrate from {lxml} to {html5lib, bleach} Idea: migrate from {lxml, html-text} to {html5lib, bleach} Dec 12, 2020
@jayaddison
Copy link
Contributor Author

Closing this pull request for now; if and when there are significant performance improvements available to html5lib, it might be possible to revisit this.

@jayaddison jayaddison closed this Feb 9, 2021
@jayaddison jayaddison deleted the migrations/html5lib branch February 9, 2021 10:22
@jayaddison
Copy link
Contributor Author

Although I'm not planning another attempt to migrate away from lxml at the moment, this seems like it might be a relevant-ish place to mention that https://github.com/willforde/python-htmlement.git/ is an open-source Python package that provides an ElementTree interface over the built-in html.parser in Python.

I haven't attempted to use it for anything other than basic JSON-LD XPath queries, so I don't have a lot of information to share about its compatibility and performance compared to other interfaces, but it does seem to me like a path worth exploring.

@jayaddison
Copy link
Contributor Author

See also: https://github.com/openculinary/extruct/tree/htmlement-parser for a work-in-progress branch using htmlement to provide parsing support throughout the extruct library.

Migrating the dublincore, json-ld and opengraph extractors appeared relatively straightforward -- microdata parsing is proving more challenging, largely around text handling and cleanup.

I'll also admit that this hasn't been tested on any datasets larger than the provided unit tests -- nor benchmarked -- so it's very much experimental. I'm optimistic about the parsing performance but don't yet have any data to provide.

@lopuhin
Copy link
Member

lopuhin commented Oct 28, 2022

Thanks @jayaddison , that's really interesting, would love to hear about your progress 👍

@jayaddison
Copy link
Contributor Author

Sure - glad to share a few findings.

Migration of the microdata parser has definitely been the main challenge - the path I've followed is to build parallel HTMLementMicrodataExtractor alongside the existing .LxmlMicrodataExtractor.

A few notes so far:

  • Extracting item properties and values in the correct order: after some naive attempts to match the existing LxmlMicrodataExtractor behaviour (which I imagine is fairly well optimized) and test cases, it proved more straightforward to follow the algorithmic instructions in the whatwg spec directly. Optimization should be possible later, but getting a correct (or very-near-correct) solution first seems most important.
  • Cleaning of HTML text remains unsolved. Since the existing Cleaner utility expects an lxml-derived tree as input, it's not possible to provide it with ElementTree nodes. I haven't looked in detail at the lxml cleaning code yet, but going by the constructor signature, it looks like it performs quite a number of non-trivial tasks (in other words: likely more logic than should be implemented within the extruct library). It does occur to me that some of the script/style filtering may be near-identical to CDATA content filtering, so it's possible that there might be some shortcuts to achieve that efficiently using the Python parser -- but I don't have a clear enough understanding of the problem space yet to say for certain.
  • Of the test cases and data associated with extruct, the w3c/microdata.5.2.* cases are proving most challenging to handle at the moment. I seem to remember that was the case during the html5lib approach too. And it gives me the sense that there must be other edge cases as well.

It's hard to guess when I might next make progress on this, but I'll continue to share any updates and additional findings.

@jayaddison
Copy link
Contributor Author

@lopuhin I've managed to follow a couple of different paths to get either the microdata.5.2.* or the MicrodataWithDescription tests to pass -- but not both at the same time.

Most property extraction seems more-or-less solved, except for text data.

And from experimentation with various approaches to text retrieval, it looks like the problem basically comes down tree transform/traversal to emit and combine the text tokens from the parse tree in an acceptable order.

In the case of malformed HTML (ambiguous tree representation), I have a feeling that the tree ordering emitted by htmlement could be a factor in the simplicity/complexity of text extraction. Whether ambiguous elements are placed into a nested hierarchy or instead into sibling elements, for example, could be important.

Either way I have a feeling that most of the work-in-progress in the branch is going to sit unchanged for a while longer. I'll probably next attempt a port of html_text to accept xml.etree.ElementTree objects.

(and slightly off-topic: I've also been wondering whether HTML parsing is strictly necessary to achieve W3 microdata format parsing, and whether it might be possible to write a grammar in parsimonious or similar that would accept the document as a str and identify objects within it. a very different approach, and some things like ID references would be challenging, but it feels like it might be possible)

@jayaddison
Copy link
Contributor Author

I think I finally got close to the root (no pun intended) of the challenge with migrating text retrieval to htmlement.

Basically, it's that the HTML5 spec indicates different ways to generate implied closing element tags depending on the parsing context within input HTML. It's good that this is now well-specified. That said, htmlement is currently a fast and simple library and doesn't (and perhaps shouldn't) include logic for all of those rules; instead it uses a straightforward SAX-like parsing approach.

The text content retrieval test cases in extruct cover situations where tree order -- including implied closed tag placement in the tree -- is relevant, and so it's challenging to write a correct W3 microdata extractor without also having an HTML5-compliant parser.

And I think that's where my work on this stops, for now :)

@lopuhin
Copy link
Member

lopuhin commented Oct 31, 2022

Thank you for an update @jayaddison 👍

My 2c about some potentially related work on our side - having extruct work on a tree parsed by an html5-compliant parser is something we'd also want. So far for html parsing we've been using https://html5-parser.readthedocs.io/en/latest/ which uses gumbo parser under the hood and has an ability to emit lxml.html.Element-based tree, so it's easier to migrate from lxml.html. In our use-case, we'd already have a parsed HTML tree because we need it for other things as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants