Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

jayaddison · 2020-12-11T16:55:50Z

This suggestion is derived from a specific use case where it'd be desirable to build pure-Python application containers for openculinary/crawler.

Currently a dependency on lxml is an obstacle to this, since lxml uses libxml and libxslt, and C bindings for those must be provided and compiled at build-time.

There are some practical issues to resolve:

Whitespace-related tests for W3C microdata are failing
Node-ordering-related tests for RDFA parsing are failing; this should be resolved once an upstream release of rdflib that includes Discussion around new dict-based store implementation RDFLib/rdflib#1133 is available (in short, the default memory-backed RDF triple-store was not guaranteed order-preserving)

There are questions about the impact of the migration and whether it's worthwhile:

What is the performance impact of the migration, particularly for any customers using extruct for high volumes of data?
What are the comparative safety properties of bleach and bleach-extras compared to lxml's text cleaning functionality?

In short, it may not be worthwhile, but I spent a bit of time looking into the practicality of this and figured it's worth providing that even though it's work-in-progress.

…xtras' extension for element content removal

lopuhin · 2020-12-12T09:06:18Z

Thanks for the PR @jayaddison , some notes:

There are also some dependencies which also require lxml, e.g. html-text and probably some others - although html-text can be made optional, and maybe the syntaxes you care about are not affected.

What is the performance impact of the migration, particularly for any customers using extruct for high volumes of data?

From limited tests I did, html5lib is ~10 times slower than lxml.html (or other fast parsers), and parsing is a significant portion of extruct runtime, so dropping lxml support would be too much of a performance impact for performance-sensitive users (e.g. we do run extruct on high volumes of data, and probably others do as well). On the other hand, having it optional would be fine I think - although I'm not sure how feasible it is, given that some of our dependencies also need it.

What are the comparative safety properties of bleach and bleach-extras compared to lxml's text cleaning functionality?

My understanding that the goal of cleaning was not to ensure safety, but to produce cleaner output, although I'm not 100% sure.

jayaddison · 2020-12-12T11:42:52Z

Thanks @lopuhin, that's good to confirm that performance could be a blocker for this.

There are also some dependencies which also require lxml, e.g. html-text and probably some others

Hopefully those other dependencies are no longer required; html-text can be dropped here (and I'll push some commits to do that shortly, just to demonstrate the possibility and finish up the path that was taken here). In practice, we'd want to figure out some of the other conversation around optional packages first. Speaking of which..

having it optional would be fine I think

Providing html5lib as an optional parser selection would be okay - in practice I tend to find the way that Python packaging deals with optional/substitute dependencies is less-than-ideal (and I worry that taking that route might cause other maintenance challenges in future). Perhaps with enough careful planning it'd be possible to manage the complexity there.

I did notice that there are already some abstractions over XML nodes/elements via the xmldom module (removed in this pull request currently). That could possibly handle most of the implementation details, but the dependency/packaging challenge would remain.

My understanding that the goal of cleaning was not to ensure safety, but to produce cleaner output, although I'm not 100% sure.

That makes sense; extracting text is likely the primary concern for most users. That said, changes to content handling in libraries have a strange way of affecting downstream applications, and developers for those would want to be aware of and prepared for a potentially risky change like this.

I think it'd be unlikely that one sanitization library would be conclusively 'better' or 'worse' in all situations: more likely that different situations would be handled by each one. Perhaps there are some good sources of test cases out there; I'll try to look into that soon.

…titem__' access does not gracefully handle absent keys

jayaddison · 2021-02-09T10:22:47Z

Closing this pull request for now; if and when there are significant performance improvements available to html5lib, it might be possible to revisit this.

jayaddison · 2022-10-21T16:07:49Z

Although I'm not planning another attempt to migrate away from lxml at the moment, this seems like it might be a relevant-ish place to mention that https://github.com/willforde/python-htmlement.git/ is an open-source Python package that provides an ElementTree interface over the built-in html.parser in Python.

I haven't attempted to use it for anything other than basic JSON-LD XPath queries, so I don't have a lot of information to share about its compatibility and performance compared to other interfaces, but it does seem to me like a path worth exploring.

jayaddison · 2022-10-25T20:05:47Z

See also: https://github.com/openculinary/extruct/tree/htmlement-parser for a work-in-progress branch using htmlement to provide parsing support throughout the extruct library.

Migrating the dublincore, json-ld and opengraph extractors appeared relatively straightforward -- microdata parsing is proving more challenging, largely around text handling and cleanup.

I'll also admit that this hasn't been tested on any datasets larger than the provided unit tests -- nor benchmarked -- so it's very much experimental. I'm optimistic about the parsing performance but don't yet have any data to provide.

lopuhin · 2022-10-28T08:17:06Z

Thanks @jayaddison , that's really interesting, would love to hear about your progress 👍

jayaddison · 2022-10-28T12:35:37Z

Sure - glad to share a few findings.

Migration of the microdata parser has definitely been the main challenge - the path I've followed is to build parallel HTMLementMicrodataExtractor alongside the existing .LxmlMicrodataExtractor.

A few notes so far:

Extracting item properties and values in the correct order: after some naive attempts to match the existing LxmlMicrodataExtractor behaviour (which I imagine is fairly well optimized) and test cases, it proved more straightforward to follow the algorithmic instructions in the whatwg spec directly. Optimization should be possible later, but getting a correct (or very-near-correct) solution first seems most important.
Cleaning of HTML text remains unsolved. Since the existing Cleaner utility expects an lxml-derived tree as input, it's not possible to provide it with ElementTree nodes. I haven't looked in detail at the lxml cleaning code yet, but going by the constructor signature, it looks like it performs quite a number of non-trivial tasks (in other words: likely more logic than should be implemented within the extruct library). It does occur to me that some of the script/style filtering may be near-identical to CDATA content filtering, so it's possible that there might be some shortcuts to achieve that efficiently using the Python parser -- but I don't have a clear enough understanding of the problem space yet to say for certain.
Of the test cases and data associated with extruct, the w3c/microdata.5.2.* cases are proving most challenging to handle at the moment. I seem to remember that was the case during the html5lib approach too. And it gives me the sense that there must be other edge cases as well.

It's hard to guess when I might next make progress on this, but I'll continue to share any updates and additional findings.

jayaddison · 2022-10-29T19:28:38Z

@lopuhin I've managed to follow a couple of different paths to get either the microdata.5.2.* or the MicrodataWithDescription tests to pass -- but not both at the same time.

Most property extraction seems more-or-less solved, except for text data.

And from experimentation with various approaches to text retrieval, it looks like the problem basically comes down tree transform/traversal to emit and combine the text tokens from the parse tree in an acceptable order.

In the case of malformed HTML (ambiguous tree representation), I have a feeling that the tree ordering emitted by htmlement could be a factor in the simplicity/complexity of text extraction. Whether ambiguous elements are placed into a nested hierarchy or instead into sibling elements, for example, could be important.

Either way I have a feeling that most of the work-in-progress in the branch is going to sit unchanged for a while longer. I'll probably next attempt a port of html_text to accept xml.etree.ElementTree objects.

(and slightly off-topic: I've also been wondering whether HTML parsing is strictly necessary to achieve W3 microdata format parsing, and whether it might be possible to write a grammar in parsimonious or similar that would accept the document as a str and identify objects within it. a very different approach, and some things like ID references would be challenging, but it feels like it might be possible)

jayaddison · 2022-10-30T12:09:18Z

I think I finally got close to the root (no pun intended) of the challenge with migrating text retrieval to htmlement.

Basically, it's that the HTML5 spec indicates different ways to generate implied closing element tags depending on the parsing context within input HTML. It's good that this is now well-specified. That said, htmlement is currently a fast and simple library and doesn't (and perhaps shouldn't) include logic for all of those rules; instead it uses a straightforward SAX-like parsing approach.

The text content retrieval test cases in extruct cover situations where tree order -- including implied closed tag placement in the tree -- is relevant, and so it's challenging to write a correct W3 microdata extractor without also having an HTML5-compliant parser.

And I think that's where my work on this stops, for now :)

lopuhin · 2022-10-31T07:47:18Z

Thank you for an update @jayaddison 👍

My 2c about some potentially related work on our side - having extruct work on a tree parsed by an html5-compliant parser is something we'd also want. So far for html parsing we've been using https://html5-parser.readthedocs.io/en/latest/ which uses gumbo parser under the hood and has an ability to emit lxml.html.Element-based tree, so it's easier to migrate from lxml.html. In our use-case, we'd already have a parsed HTML tree because we need it for other things as well.

jayaddison added 3 commits December 11, 2020 13:57

First-pass substitution of lxml for html5lib

1a25757

Replace lxml-based text cleaning with 'bleach' library plus 'bleach-e…

f5f6682

…xtras' extension for element content removal

Remove lxml dependency

731f53f

jayaddison added 2 commits December 12, 2020 11:22

Cleanup: remove html-text dependency

7759a7c

Cleanup: remove unused import

0d139d0

jayaddison changed the title ~~Idea: migrate from {lxml} to {html5lib, bleach}~~ Idea: migrate from {lxml, html-text} to {html5lib, bleach} Dec 12, 2020

jayaddison added 7 commits December 12, 2020 12:45

Fixup: use Python 2.7-compatible dictionary construction

1172b4e

Fixup: prefer to use top-level NamedNodeMap 'keys' method since '__ge…

f3c9dca

…titem__' access does not gracefully handle absent keys

Add Python2.7 XML canonicalization workaround

3914730

Merge branch 'master' into migrations/html5lib

0434970

Nit: remove redundant local variable assignment

d8577c4

Merge branch 'master' into migrations/html5lib

be6c6aa

Cleanup: simplify attribute handling

1e7bd9e

jayaddison closed this Feb 9, 2021

jayaddison deleted the migrations/html5lib branch February 9, 2021 10:22

jayaddison mentioned this pull request Oct 30, 2022

Expand previously-opened-element check to include all currently-open elements willforde/python-htmlement#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

jayaddison commented Dec 11, 2020

lopuhin commented Dec 12, 2020

jayaddison commented Dec 12, 2020

jayaddison commented Feb 9, 2021

jayaddison commented Oct 21, 2022

jayaddison commented Oct 25, 2022

lopuhin commented Oct 28, 2022

jayaddison commented Oct 28, 2022

jayaddison commented Oct 29, 2022

jayaddison commented Oct 30, 2022

lopuhin commented Oct 31, 2022

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

Idea: migrate from {lxml, html-text} to {html5lib, bleach} #163

Conversation

jayaddison commented Dec 11, 2020

lopuhin commented Dec 12, 2020

jayaddison commented Dec 12, 2020

jayaddison commented Feb 9, 2021

jayaddison commented Oct 21, 2022

jayaddison commented Oct 25, 2022

lopuhin commented Oct 28, 2022

jayaddison commented Oct 28, 2022

jayaddison commented Oct 29, 2022

jayaddison commented Oct 30, 2022

lopuhin commented Oct 31, 2022