Process inner html of blocks when escaping text content #719

jffng · 2024-09-09T18:48:01Z

What
This PR enables HTML markup inside block content to be processed while still allowing the text content to be safely escaped.

Before	After
Kapture.2024-09-11.at.15.22.11.mp4	719.after_reduced.mp4

How
Updates the CBT_Theme_Locale::escape_text_content function to handle nested HTML markup.

Uses WP_HTML_Tag_Processor to process every token of the blocks content, and generates a string that can be formatted and then translated.

Why
Fixes #573 #682 and #691.

To test

Add some markup to the inside of a block in a template.
Save the changes to the theme.
Verify the template has been patternized correctly, and the template appears as expected.

Update remaining tests with inner markup. Try and fix tests. Format tests.

jffng · 2024-09-11T21:18:48Z

includes/create-theme/theme-locale.php

+					$tokens[] = "</{$token_name}>";
+				} else {
+					// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
+					switch ( $token_name ) {


One thing I would like feedback on is whether this is too fragile, complicated, or tedious an approach to maintain.

I plan to add a few more unit tests.

I think we're good to rely on the output of get_token_name(), so based on that, I believe this is a fine way to parse these tags. We have a limited scope since we're currently only dealing with a, img and mark, and otherwise we default to the original tag.

That's only my opinion though 😅 Was there anything you were specifically concerned about with this approach?

I think just needing to add cases for specific tags and if the attributes changes, it may break.

But the options for adding inline HTML / formatting inside blocks are rather limited, so maybe it's okay.

Do we have to deconstruct the attributes and reassemble them? I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

For example, could we replace <a href="https://wordpress.org">WordPress</a> with something like

<a href="<?php echo esc_url( 'https://wordpress.org' ) >?"><?php echo esc_html__( 'WordPress' ) ?></a>

but maintain any other attributes added to the <a> tag?

I'm not really worried about escaping every possible attribute, I think the main thing is to make sure the translated strings are escaped as a form of user input.

I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

That's a good point. I could not find a way with the HTML Tag Processor to just carry over the whole tag token and all its attributes. But I refactored the approach so that all the attributes should be carried over, rather than processing the attributes based on the type of tag: 678a8ef#diff-fc11ad66397ed68725a8fe74f3031f104368f850b5b94370123482ec9e719e1bR54-R70

madhusudhand

It works! But I encountered following scenario where multiple individually translated strings now becomes the single translation.

I guess this is because the template in the editor comes with a single string (after translations) and now it is seen as a single string.

I don't think this should be blocking this change. It is to be addressed separately by reading original pattern and current template together may be?

Original

<h2 class="wp-block-heading" id="botswana-new-zealand-south-korea-japan-madagascar" style="font-size:80px;font-style:normal;font-weight:900;line-height:1.1;text-transform:uppercase"><a href=""><?php echo esc_html__( 'ITALY', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'COSTA RICA', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'CANADA', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'LAOS', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'TURKEY', 'adventurer' ); ?></a></h2>

After CBT localization

<h2 class="wp-block-heading" id="botswana-new-zealand-south-korea-japan-madagascar" style="font-size:80px;font-style:normal;font-weight:900;line-height:1.1;text-transform:uppercase"><?php /* Translators: %s are html tags */ echo sprintf( esc_html__( '%sITALY%s%s%sCOSTA RICA%s%s%sCANADA%s%s%sLAOS%s%s%sTURKEY%s', 'adventurer' ), '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>' ); ?></h2>

mikachan

Thanks so much for handling this @jffng! I can see this will fix the issues we've been seeing with translations ✨

@creativecoder, if you get a chance it would be great to hear your thoughts on this.

mikachan · 2024-09-12T15:33:33Z

includes/create-theme/theme-locale.php

+					$tokens[] = "</{$token_name}>";
+				} else {
+					// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
+					switch ( $token_name ) {


I think we're good to rely on the output of get_token_name(), so based on that, I believe this is a fine way to parse these tags. We have a limited scope since we're currently only dealing with a, img and mark, and otherwise we default to the original tag.

That's only my opinion though 😅 Was there anything you were specifically concerned about with this approach?

creativecoder

Thanks for trying this; it looks like a very good start!

includes/create-theme/theme-locale.php

creativecoder · 2024-09-12T19:20:54Z

includes/create-theme/theme-locale.php

+					$tokens[] = "</{$token_name}>";
+				} else {
+					// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
+					switch ( $token_name ) {


Do we have to deconstruct the attributes and reassemble them? I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

For example, could we replace <a href="https://wordpress.org">WordPress</a> with something like

<a href="<?php echo esc_url( 'https://wordpress.org' ) >?"><?php echo esc_html__( 'WordPress' ) ?></a>

but maintain any other attributes added to the <a> tag?

I'm not really worried about escaping every possible attribute, I think the main thing is to make sure the translated strings are escaped as a form of user input.

includes/create-theme/theme-locale.php

tests/test-theme-templates.php

Escape at the end. Provide better translation note.

jffng · 2024-09-23T17:32:27Z

I think I addressed all the feedback:

Changed approach so we don't drop any attributes
Refactored to its own class.
Handled case when text contains % and added a test for it

Ready for another review.

mikachan

Thanks so much @jffng! I've tested the latest changes out and I believe everything is working as described. Great job 👏

Here are my test results, using the Adventurer theme as my test theme, and editing the Blog Home template (which uses a list of headings with links):

trunk	This PR

creativecoder

This is working well at solving the problem in my testing! 🎉

Here's the block markup I used:

<!-- wp:paragraph {"align":"center"} -->
<p class="has-text-align-center"><strong>Études</strong> is a pioneering firm that seamlessly merges creativity<br>and functionality to <a href="https://wordpress.org" data-type="link" data-id="https://wordpress.org">redefine <em>architectural</em> excellence</a>.</p>
<!-- /wp:paragraph -->

And here's the output in the .php pattern file:

<?php /* Translators: %1$s is the start of a 'strong' HTML element, %2$s is the end of a 'strong' HTML element, %3$s is the start of a 'br' HTML element, %4$s is the start of a 'a' HTML element, %5$s is the start of a 'em' HTML element, %6$s is the end of a 'em' HTML element, %7$s is the end of a 'a' HTML element */  echo sprintf( esc_html__( '%1$sÉtudes%2$s is a pioneering firm that seamlessly merges creativity%3$sand functionality to %4$sredefine %5$sarchitectural%6$s excellence%7$s.', 'twentytwentyfour' ), '<strong>', '</strong>', '<br>', '<a href="' . esc_url( 'https://wordpress.org' ) . '" data-type="link" data-id="https://wordpress.org">', '<em>', '</em>', '</a>' ); ?>

A few changes I think we need to make this output correct

I believe the /* translators: comment needs to be on a separate line, immediately above the call to the translation function (but I'm not 100% sure)
When there are multiple placeholders in a string, the comment should be numbered, e.g. /* Translators: 1: start of 'strong' HTML element, 2: ... (see plugin docs)
If there's a way to differentiate between tags that are self closing (like <br>), it would be nice to modify the comment (1: 'br' HTML element, 2: ...), so it's not confusing when there's no corresponding closing tag

creativecoder · 2024-09-26T02:46:16Z

includes/create-theme/theme-locale.php

+		$translators_note = $p->get_translators_note();
+
+		if ( ! empty( $tokens ) ) {
+			return "<?php $translators_note echo sprintf( esc_html__( '$text', '" . wp_get_theme()->get( 'TextDomain' ) . "' ), " . implode(


I'm not certain, but I suspect the translators comment may need to be on a separate line, immediately above the esc_html__ function, to be parsed correctly.

creativecoder · 2024-09-26T02:49:06Z

includes/create-theme/theme-token-processor.php

+
+					$token                  .= '>';
+					$this->tokens[]          = $token;
+					$this->translators_note .= '%' . $this->increment . "\$s is the start of a '" . $token_name . "' HTML element";


Do we have any way of differentiating self closing tags? For example "%3$s is the start of a 'br' HTML element" might be a little confusing, since it doesn't (usually) have a closing tag.

jffng added 4 commits September 9, 2024 14:39

WIP working to convert attributes and URLs.

5724b49

Ensure media is added to local.

723a88f

Use a list of tokens and sprintf to generate the formatted string.

a9ca35a

Only format the string if tokens are present.

0460dc5

jffng force-pushed the fix/nested-html-translation-string-conversion branch from bdf39dd to 9f43b73 Compare September 10, 2024 21:52

Update content with html test.

194764c

Update remaining tests with inner markup. Try and fix tests. Format tests.

jffng force-pushed the fix/nested-html-translation-string-conversion branch from 807653d to 194764c Compare September 11, 2024 19:08

Remove whitespace.

45c312c

jffng marked this pull request as ready for review September 11, 2024 19:35

This was linked to issues Sep 11, 2024

HTML inside text blocks renders as text. #682

Open

Bug: Something isn't right when selecting all options in Save Changes to Theme #691

Open

jffng self-assigned this Sep 11, 2024

jffng changed the title ~~Process inner html of blocks when escaping~~ Process inner html of blocks when escaping text content Sep 11, 2024

jffng commented Sep 11, 2024

View reviewed changes

madhusudhand reviewed Sep 12, 2024

View reviewed changes

mikachan reviewed Sep 12, 2024

View reviewed changes

creativecoder reviewed Sep 12, 2024

View reviewed changes

madhusudhand reviewed Sep 13, 2024

View reviewed changes

includes/create-theme/theme-locale.php Outdated Show resolved Hide resolved

tests/test-theme-templates.php Show resolved Hide resolved

jffng added 5 commits September 13, 2024 13:01

Process all attributes

678a8ef

Escape at the end. Provide better translation note.

Refactor token processing to its own class.

bf8ff55

Update tests with string replacements and translation.

d55f322

Check if % exists in the text and escape it.

3f0adfa

Add a test case for a localizing text that includes a %.

4504111

jffng requested review from mikachan, madhusudhand and creativecoder September 23, 2024 17:30

mikachan mentioned this pull request Sep 24, 2024

after "Save Changes to Theme" links in footer are parsed to text #731

Closed

mikachan approved these changes Sep 24, 2024

View reviewed changes

creativecoder reviewed Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process inner html of blocks when escaping text content #719

Process inner html of blocks when escaping text content #719

jffng commented Sep 9, 2024 •

edited

Loading

jffng Sep 11, 2024

mikachan Sep 12, 2024

jffng Sep 12, 2024

creativecoder Sep 12, 2024

jffng Sep 13, 2024

madhusudhand left a comment •

edited

Loading

mikachan left a comment

mikachan Sep 12, 2024

creativecoder left a comment

creativecoder Sep 12, 2024

jffng commented Sep 23, 2024

mikachan left a comment

creativecoder left a comment

creativecoder Sep 26, 2024

creativecoder Sep 26, 2024

Process inner html of blocks when escaping text content #719

Are you sure you want to change the base?

Process inner html of blocks when escaping text content #719

Conversation

jffng commented Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madhusudhand left a comment • edited Loading

Choose a reason for hiding this comment

mikachan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

creativecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jffng commented Sep 23, 2024

mikachan left a comment

Choose a reason for hiding this comment

creativecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jffng commented Sep 9, 2024 •

edited

Loading

madhusudhand left a comment •

edited

Loading