Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process inner html of blocks when escaping text content #719

Open
wants to merge 11 commits into
base: trunk
Choose a base branch
from

Conversation

jffng
Copy link
Contributor

@jffng jffng commented Sep 9, 2024

What
This PR enables HTML markup inside block content to be processed while still allowing the text content to be safely escaped.

Before After
Kapture.2024-09-11.at.15.22.11.mp4
719.after_reduced.mp4

How
Updates the CBT_Theme_Locale::escape_text_content function to handle nested HTML markup.

Uses WP_HTML_Tag_Processor to process every token of the blocks content, and generates a string that can be formatted and then translated.

Why
Fixes #573 #682 and #691.

To test

  • Add some markup to the inside of a block in a template.
  • Save the changes to the theme.
  • Verify the template has been patternized correctly, and the template appears as expected.

@jffng jffng force-pushed the fix/nested-html-translation-string-conversion branch from bdf39dd to 9f43b73 Compare September 10, 2024 21:52
Update remaining tests with inner markup.

Try and fix tests.

Format tests.
@jffng jffng force-pushed the fix/nested-html-translation-string-conversion branch from 807653d to 194764c Compare September 11, 2024 19:08
@jffng jffng marked this pull request as ready for review September 11, 2024 19:35
@jffng jffng self-assigned this Sep 11, 2024
@jffng jffng changed the title Process inner html of blocks when escaping Process inner html of blocks when escaping text content Sep 11, 2024
$tokens[] = "</{$token_name}>";
} else {
// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
switch ( $token_name ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I would like feedback on is whether this is too fragile, complicated, or tedious an approach to maintain.

I plan to add a few more unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good to rely on the output of get_token_name(), so based on that, I believe this is a fine way to parse these tags. We have a limited scope since we're currently only dealing with a, img and mark, and otherwise we default to the original tag.

That's only my opinion though 😅 Was there anything you were specifically concerned about with this approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just needing to add cases for specific tags and if the attributes changes, it may break.

But the options for adding inline HTML / formatting inside blocks are rather limited, so maybe it's okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to deconstruct the attributes and reassemble them? I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

For example, could we replace <a href="https://wordpress.org">WordPress</a> with something like

<a href="<?php echo esc_url( 'https://wordpress.org' ) >?"><?php echo esc_html__( 'WordPress' ) ?></a>

but maintain any other attributes added to the <a> tag?

I'm not really worried about escaping every possible attribute, I think the main thing is to make sure the translated strings are escaped as a form of user input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

That's a good point. I could not find a way with the HTML Tag Processor to just carry over the whole tag token and all its attributes. But I refactored the approach so that all the attributes should be carried over, rather than processing the attributes based on the type of tag: 678a8ef#diff-fc11ad66397ed68725a8fe74f3031f104368f850b5b94370123482ec9e719e1bR54-R70

Copy link
Contributor

@madhusudhand madhusudhand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! But I encountered following scenario where multiple individually translated strings now becomes the single translation.

I guess this is because the template in the editor comes with a single string (after translations) and now it is seen as a single string.

I don't think this should be blocking this change. It is to be addressed separately by reading original pattern and current template together may be?

Original

<h2 class="wp-block-heading" id="botswana-new-zealand-south-korea-japan-madagascar" style="font-size:80px;font-style:normal;font-weight:900;line-height:1.1;text-transform:uppercase"><a href=""><?php echo esc_html__( 'ITALY', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'COSTA RICA', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'CANADA', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'LAOS', 'adventurer' ); ?></a><br><a href=""><?php echo esc_html__( 'TURKEY', 'adventurer' ); ?></a></h2>

After CBT localization

<h2 class="wp-block-heading" id="botswana-new-zealand-south-korea-japan-madagascar" style="font-size:80px;font-style:normal;font-weight:900;line-height:1.1;text-transform:uppercase"><?php /* Translators: %s are html tags */ echo sprintf( esc_html__( '%sITALY%s%s%sCOSTA RICA%s%s%sCANADA%s%s%sLAOS%s%s%sTURKEY%s', 'adventurer' ), '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>', '<br>', '<a href="">', '</a>' ); ?></h2>

Copy link
Member

@mikachan mikachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for handling this @jffng! I can see this will fix the issues we've been seeing with translations ✨

@creativecoder, if you get a chance it would be great to hear your thoughts on this.

$tokens[] = "</{$token_name}>";
} else {
// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
switch ( $token_name ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good to rely on the output of get_token_name(), so based on that, I believe this is a fine way to parse these tags. We have a limited scope since we're currently only dealing with a, img and mark, and otherwise we default to the original tag.

That's only my opinion though 😅 Was there anything you were specifically concerned about with this approach?

Copy link
Contributor

@creativecoder creativecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying this; it looks like a very good start!

includes/create-theme/theme-locale.php Show resolved Hide resolved
includes/create-theme/theme-locale.php Outdated Show resolved Hide resolved
$tokens[] = "</{$token_name}>";
} else {
// Depending on the HTML tag, we may need to process attributes so they are correctly added to the placeholder.
switch ( $token_name ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to deconstruct the attributes and reassemble them? I'm worried about dropping attributes or other parts of the markup that might be added by plugins or filters.

For example, could we replace <a href="https://wordpress.org">WordPress</a> with something like

<a href="<?php echo esc_url( 'https://wordpress.org' ) >?"><?php echo esc_html__( 'WordPress' ) ?></a>

but maintain any other attributes added to the <a> tag?

I'm not really worried about escaping every possible attribute, I think the main thing is to make sure the translated strings are escaped as a form of user input.

includes/create-theme/theme-locale.php Outdated Show resolved Hide resolved
includes/create-theme/theme-locale.php Show resolved Hide resolved
@jffng
Copy link
Contributor Author

jffng commented Sep 23, 2024

I think I addressed all the feedback:

  • Changed approach so we don't drop any attributes
  • Refactored to its own class.
  • Handled case when text contains % and added a test for it

Ready for another review.

Copy link
Member

@mikachan mikachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @jffng! I've tested the latest changes out and I believe everything is working as described. Great job 👏

Here are my test results, using the Adventurer theme as my test theme, and editing the Blog Home template (which uses a list of headings with links):

trunk This PR
image image

Copy link
Contributor

@creativecoder creativecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working well at solving the problem in my testing! 🎉

Here's the block markup I used:

<!-- wp:paragraph {"align":"center"} -->
<p class="has-text-align-center"><strong>Études</strong> is a pioneering firm that seamlessly merges creativity<br>and functionality to <a href="https://wordpress.org" data-type="link" data-id="https://wordpress.org">redefine <em>architectural</em> excellence</a>.</p>
<!-- /wp:paragraph -->

And here's the output in the .php pattern file:

<?php /* Translators: %1$s is the start of a 'strong' HTML element, %2$s is the end of a 'strong' HTML element, %3$s is the start of a 'br' HTML element, %4$s is the start of a 'a' HTML element, %5$s is the start of a 'em' HTML element, %6$s is the end of a 'em' HTML element, %7$s is the end of a 'a' HTML element */  echo sprintf( esc_html__( '%1$sÉtudes%2$s is a pioneering firm that seamlessly merges creativity%3$sand functionality to %4$sredefine %5$sarchitectural%6$s excellence%7$s.', 'twentytwentyfour' ), '<strong>', '</strong>', '<br>', '<a href="' . esc_url( 'https://wordpress.org' ) . '" data-type="link" data-id="https://wordpress.org">', '<em>', '</em>', '</a>' ); ?>

A few changes I think we need to make this output correct

  • I believe the /* translators: comment needs to be on a separate line, immediately above the call to the translation function (but I'm not 100% sure)
  • When there are multiple placeholders in a string, the comment should be numbered, e.g. /* Translators: 1: start of 'strong' HTML element, 2: ... (see plugin docs)
  • If there's a way to differentiate between tags that are self closing (like <br>), it would be nice to modify the comment (1: 'br' HTML element, 2: ...), so it's not confusing when there's no corresponding closing tag

$translators_note = $p->get_translators_note();

if ( ! empty( $tokens ) ) {
return "<?php $translators_note echo sprintf( esc_html__( '$text', '" . wp_get_theme()->get( 'TextDomain' ) . "' ), " . implode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but I suspect the translators comment may need to be on a separate line, immediately above the esc_html__ function, to be parsed correctly.


$token .= '>';
$this->tokens[] = $token;
$this->translators_note .= '%' . $this->increment . "\$s is the start of a '" . $token_name . "' HTML element";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way of differentiating self closing tags? For example "%3$s is the start of a 'br' HTML element" might be a little confusing, since it doesn't (usually) have a closing tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants