Fixed blog readtime calculation to ignore non-content text #7370

sisp · 2024-07-16T13:19:55Z

I've fixed the blog readtime calculation to ignore non-content text, i.e. any text nodes that are inside the following tags are omitted from calculating the readtime:

<object>
<script>
<style>
<svg>

The implementation is inspired by the search plugin parser.

Note that I've duplicated the void set from the search plugin. It might be cleaner to deduplicate it, but I wasn't sure where to best move it to import from. If you'd prefer deduplication, I'd appreciate your guidance on where to factor void out, @squidfunk.

Fixes #7367.

squidfunk · 2024-07-16T13:23:08Z

Note that I've duplicated the void set from the search plugin. It might be cleaner to deduplicate it, but I wasn't sure where to best move it to import from. If you'd prefer deduplication, I'd appreciate your guidance on where to factor void out, @squidfunk.

There's no need to integrate the void tags at all, as for them, the data handler is not called. Void tags are tags without children, i.e., leaf components. Please remove the void handling logic, as it does not influence the result.

squidfunk · 2024-07-16T13:24:34Z

Ah wait, no, we might need them to maintain the context. Can you please just import it from the search plugin for now and make a note in the code that we might refactor it at some point? Please don't do the refactoring now, we're heavily working behind the curtains right now and there will be fundamental changes to Material for MkDocs, which is why we currently try to keep things as stable and unchanging as possible.

squidfunk · 2024-07-16T13:25:23Z

... you might check if the _endtag handler is called for void tags. If it is, remove the void logic. If not, keep it. It's been too long since I've worked with Python's HTML SAX parser.

sisp · 2024-07-16T13:43:39Z

I've checked: The _endtag is not called for void tags. Now, the void set is imported from the search plugin as you suggested.

squidfunk · 2024-07-16T13:48:03Z

Okay great, thanks for investigating. Note that this is an HTML4 parser, not HTML5, which we chose for speed and package size (as no further dependency is needed), so as I remember, this is why we need to check for void tags by ourselves. We'll check in the future if we can swap this out for an approach that directly operates on the HTML AST, which would be much more efficient, but for now, I think we can keep it.

squidfunk · 2024-07-16T13:49:09Z

From what I can see, this looks great, ready to merge! Thanks for your time!

sisp · 2024-07-16T14:28:26Z

Perfect, thanks for the helpful review and feedback! 🙏

Fixed blog readtime calculation to ignore non-content text

4c9c02f

sisp force-pushed the fix/blog-readtime branch from 22054eb to 4c9c02f Compare July 16, 2024 13:42

squidfunk merged commit 6b13c56 into squidfunk:master Jul 16, 2024
4 checks passed

sisp deleted the fix/blog-readtime branch July 16, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed blog readtime calculation to ignore non-content text #7370

Fixed blog readtime calculation to ignore non-content text #7370

sisp commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

sisp commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

sisp commented Jul 16, 2024

Fixed blog readtime calculation to ignore non-content text #7370

Fixed blog readtime calculation to ignore non-content text #7370

Conversation

sisp commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

sisp commented Jul 16, 2024

squidfunk commented Jul 16, 2024

squidfunk commented Jul 16, 2024

sisp commented Jul 16, 2024