Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I18n: Add check to detect string wrapped in HTML #1856

Merged
merged 11 commits into from
Feb 25, 2020

Conversation

ockham
Copy link
Contributor

@ockham ockham commented Jan 30, 2020

Fixes the non-controversial part of #1419 by adding a sniff that detects translated strings wrapped in HTML tags, and a corresponding unit test. Quoting #1419 (comment):

Examples where the markup should be removed from inside of the string would be:

<?php __( '<h1>Settings Page</h1>', 'text-domain ); ?>
should be
<h1><?php __( 'Settings Page', 'text-domain' ); ?></h1>
because the markup only wraps the string.

Note that I had to add code to I18nUnitTest.php to reset the text domain as set via setConfigData() for one test file, as it would otherwise persist for every test file after that one.

Based on a WordPress.com-internal check. /cc @akirk

@@ -80,6 +80,8 @@ class I18nSniff extends AbstractFunctionRestrictionsSniff {
[bcdeEfFgGosuxX] # Type specifier.
)/x';

const WRAPPED_IN_HTML_REGEX = '#^(<[^>]+>).*(<[^>]+>)$#';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also include the forward slash / in the closing tag, so it doesn't match <foo>123<foo>? i.e. ^(<[^>]+>).*(<\/[^>]+>)$

Also, not sure if it matters, but this will also match empty content (e.g. <foo><foo>) or non-alphanumeric tags (e.g. <<>>123<<>). Maybe using \w instead of [^>] would handle these? 🤔

Sorry if I'm just adding noise, only tested briefly here: https://regex101.com/r/rIp4dV/3/tests

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more noise: here's same tests with \w https://regex101.com/r/rIp4dV/4/tests :D

Copy link

@chriskmnds chriskmnds Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah definitely some noise. HTML5 has optional closing slashes

whoops totally irrelevant comment - HTML5 has optional self closing slashes for void elements i believe. but that's totally irrelevant here :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've pointed out in Slack, it's really hard to parse HTML with a RegEx -- even if that HTML is supposedly simply, there can be nuance such as e.g. an attribute containing >, such as in <address id="%s" data-type=">foo" value="%s">my address</address>

See this famous Stack Overflow post: https://stackoverflow.com/a/1732454

The correct way would probably be to use one of PHP's included HTML parsing facilities, but OTOH, that seems a bit overblown.

Some of these cases are arguably rather pathological, so I'm going to stop tweaking the RegEx for now. I think it might be fine for starters, and it can be extended to include use cases that other people might run into over time.

Copy link

@chriskmnds chriskmnds Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agree 100%. 😃

See this famous Stack Overflow post: https://stackoverflow.com/a/1732454

Best post on Stack Overflow! 😀

Copy link
Member

@jrfnl jrfnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ockham Thank you for this PR. The principle of it looks good, the fine details still need some work.

I hope you'll find the feedback I've left useful. Please let me know if you have any questions.

Also: please make sure the build passes on your PR. In this case, comments are missing punctuation, but you can review the Travis output after pushing updates to see the details on why a build is failing (if it fails).

WordPress/Sniffs/WP/I18nSniff.php Show resolved Hide resolved
WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
WordPress/Tests/WP/I18nUnitTest.4.inc Outdated Show resolved Hide resolved
// phpcs:set WordPress.WP.I18n text_domain[] my-slug

__( '<address>123 Fake Street</address>', 'my-slug' ); // Bad, string shouldn't be wrapped in HTML.
__( 'I live at <address>123 Fake Street</address>', 'my-slug' ); // Okay, no consensus on HTML tags withing strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some more unit tests. This feels very minimal for a new feature.

WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
@ockham
Copy link
Contributor Author

ockham commented Jan 31, 2020

I think this might be ready for another look 😊

jrfnl
jrfnl previously requested changes Feb 1, 2020
Copy link
Member

@jrfnl jrfnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ockham Thanks for making those changes. Definitely going in the right direction, though it still needs a little more work. Hope you'll find my remarks useful. Let me know if you have any questions.

Also: once done, could you squash the commits into one ?

WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
WordPress/Sniffs/WP/I18nSniff.php Outdated Show resolved Hide resolved
__( 'I live at <address>123 Fake Street</address>', 'my-slug' ); // Okay, no consensus on HTML tags withing strings.
__( '<div class="my-class">Translatable content</div>', 'my-slug' ); // Bad
__( '<option id="%1$s" value="%2$s">Translatable option name</option>', 'my-slug' ); // Wrapping is okay, since there are placeholders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see some additional tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your example and another variation of an existing one, but seem to struggle to come up with meaningful ones overall. Could you give me some pointers as to what you're thinking of?

Copy link
Member

@jrfnl jrfnl Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, try this as a good starting point for finding more tests: try running the sniff over WP core and see which false positives it throws up for this new error code. Those cases will all need to be accounted for and the patterns found which caused the false positives should be covered by unit tests.

Also see the link in the original issue to Core trac where there are also quite some examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, try this as a good starting point for finding more tests: try running the sniff over WP core and see which false positives it throws up for this new error code. Those cases will all need to be accounted for and the patterns found which caused the false positives should be covered by unit tests.

Ah yeah, makes sense, thanks. wp-includes doesn't yield any occurrences, wp-content has the following:

https://github.com/WordPress/WordPress/blob/c289bb59acdd3217cdb81e15b580dc3d7cd6a007/wp-content/themes/twentythirteen/image.php#L30

https://github.com/WordPress/WordPress/blob/c289bb59acdd3217cdb81e15b580dc3d7cd6a007/wp-content/themes/twentyeleven/inc/theme-options.php#L121-L122

Those seem legit, but I think we've covered that class of issue with our current unit tests, no?

wp-admin has quite a few, e.g.

https://github.com/WordPress/WordPress/blob/c289bb59acdd3217cdb81e15b580dc3d7cd6a007/wp-admin/comment.php#L56-L57

https://github.com/WordPress/WordPress/blob/c289bb59acdd3217cdb81e15b580dc3d7cd6a007/wp-admin/edit-comments.php#L194-L197

https://github.com/WordPress/WordPress/blob/c289bb59acdd3217cdb81e15b580dc3d7cd6a007/wp-admin/edit-form-advanced.php#L313-L314

and so on.

A lot of these are <a /> links where only the link text needs translating, and the actual <a /> could be moved out of the __() call.

Also see the link in the original issue to Core trac where there are also quite some examples.

Uh, not sure which issue you mean 😅 Maybe this list (which #1419 (comment) refers to)? I checked a few of those.

<span>Previewing:</span> %s
Plugin <strong>resumed</strong>.
<span class="screen-reader-text">link to view ratings opens in a new tab</span>

These also seem pretty much covered I'd say 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these are links where only the link text needs translating, and the actual could be moved out of the __() call.

Sorry, but that's not true. Those are legit translation calls including the tag as there may be a localized page available for that URL and part of translating is changing the URL to the localized version.

I think it would be a good idea to allow for those, i.e. if the tag is an a AND has a href with a hard-coded URL, there should be no warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these are links where only the link text needs translating, and the actual could be moved out of the __() call.

Sorry, but that's not true. Those are legit translation calls including the tag as there may be a localized page available for that URL and part of translating is changing the URL to the localized version.

Right, I didn't think of that.

I think it would be a good idea to allow for those, i.e. if the tag is an a AND has a href with a hard-coded URL, there should be no warning.

I'm not totally sure it should be generalized that much though? This seems like a fairly 'semantic' addition to the sniff that implies quite some background information (there can be localized link targets), which might not be true for all codebases alike. One could argue that the localized link targets are more the exception than the rule, that the sniff doesn't necessarily need to know about them, and that it'd be fine to require a sniff ignore to be added by the author in case it's needed 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the warning would be kept for link tags with hard-coded URLs, I'd like to see a separate error code for those.
That would at least allow codebases - including WP Core - which do have localized URLs and therefore do this with good reason to exclude the warning for those particular text strings from their custom ruleset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added per 1294413

@jrfnl
Copy link
Member

jrfnl commented Feb 1, 2020

I was just talking about this PR with a friend of mine and he suggested (rightly) to also consider the following case for which the HTML tags should not be removed:

__( '<span>Text</span> More text <span>Text</span>', 'domain' );

@ockham
Copy link
Contributor Author

ockham commented Feb 3, 2020

Thanks, that's all valuable feedback.

However, I'm starting to think that using RegExes here is an uphill battle, as discussed here.

I'll give one of PHP's integrated HTML parsers a shot -- XMLReader seems to have the smallest memory, resource-wise.

@ockham ockham force-pushed the add/i18n-html-tag-sniff branch 2 times, most recently from d689d4c to c766d3c Compare February 3, 2020 12:25
Fixes the non-controversial part of WordPress#1419 by adding a sniff that detects translated strings wrapped in HTML tags, and a corresponding unit test. Quoting WordPress#1419 (comment):

> Examples where the markup _should_ be removed from inside of the string would be:
>
> ```
> <?php __( '<h1>Settings Page</h1>', 'text-domain ); ?>
> should be
> <h1><?php __( 'Settings Page', 'text-domain' ); ?></h1>
> because the markup only wraps the string.
> ```

Note that I had to add code to `I18nUnitTest.php` to reset the text domain as set via `setConfigData()` for one test file, as it would otherwise persist for every test file after that one.
@ockham
Copy link
Contributor Author

ockham commented Feb 3, 2020

Rewrote the sniff using XMLReader (which seems much more precise than the RegExps could ever be for this purpose), added two more test cases (including this), and squashed the commits.

Also ran the sniff on the WordPress codebase (results are here), which seems to indicate no false positives so far, and that our current unit tests are at least representative of the kind of code encountered in a real-world codebase.

Think this is in good enough shape to be considered for merging, and where potential refinements could be addressed in follow-up PRs? 😅

@jrfnl
Copy link
Member

jrfnl commented Feb 3, 2020

Think this is in good enough shape to be considered for merging, and where potential refinements could be addressed in follow-up PRs? 😅

I'll have another look tomorrow but this PR will definitely not go into WPCS 2.2.1, which is due to be released tomorrow. I'd want this feature to be in place for beta testing for at least a little while so we can see if any more issues are reported.

@ockham ockham requested a review from jrfnl February 3, 2020 15:24
@jrfnl jrfnl changed the title I18n: Add sniff to detect string wrapped in HTML I18n: Add check to detect string wrapped in HTML Feb 4, 2020
@ockham
Copy link
Contributor Author

ockham commented Feb 6, 2020

👋 @jrfnl Would you mind giving this another look? 🙂

@jrfnl
Copy link
Member

jrfnl commented Feb 6, 2020

@ockham Has my remark in #1856 (comment) been addressed ?

@jrfnl
Copy link
Member

jrfnl commented Feb 7, 2020

Oh and please also include some tests with malformed/incomplete HTML and verify that the code handles this gracefully. I seem to remember that XMLreader does not handle that well.

Some examples of the top of my head (please try and find some more/more specific ones for known issues):

__( '<div>text to translate', 'domain' );
__( 'text to <div>translate', 'domain' );
__( 'text < to translate', 'domain' );
__( 'text to > translate', 'domain' );

@ockham
Copy link
Contributor Author

ockham commented Feb 7, 2020

I added a few commits to add your malformed HTML examples, plus some that Christos had in his regex tests. I also added a rule for <a href="...">...</a>. Please give it another look!

@jrfnl
Copy link
Member

jrfnl commented Feb 14, 2020

I'd like to get some more eyes on this. @dingo-d @GaryJones Would you please have a look ?

@dingo-d
Copy link
Member

dingo-d commented Feb 15, 2020

Looks good just by looking at it, I have started to run the i18n sniff on themes repo that I have locally (this will take a while), just to see if anything interesting pops up, since themes can get creative when it comes to what they add in the translation strings 😄

@dingo-d
Copy link
Member

dingo-d commented Feb 16, 2020

I ran this over my local themes repo and from what I see it caught everything correctly

https://gist.github.com/dingo-d/aecd13368ec1a91e1bd40772d0d94a21

I ran over the themes using

vendor/bin/phpcs --sniffs=WordPress.WP.i18n ../WordPress-Themes/themes/ --report-file=../phpcs-result.txt --error-severity=0 --extensions=php --report=code

This caught both NoHtmlWrappedStrings and MismatchedPlaceholders warnings, so there is a lot of information in the results.


// Strings wrapped in '<a href="...">...</a>'. These get a different warning, as the link target might actually be localized.
// phpcs:disable WordPress.WP.I18n.NoHtmlWrappedStrings
__( '<a href="http://wordpress.org">WordPress</a>', 'my-slug' ); // Bad, strings shouldn't be wrapped in <a href="...">...</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one that would concern me. By adding a warning, we're effectively recommending that URLs shouldn't be localized, especially when the comments say that this is "bad".

I'd rather see this one as good, and the placeholder test below possibly as bad.

Copy link
Contributor Author

@ockham ockham Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment, and the one below.

I ended up adding a separate warning for this case (NoAHrefWrappedStrings vs NoHtmlWrappedStrings) so that users would be able to separately disable the <a href=".">...</a> rule (while retaining the more 'generic' one about wrapping strings in HTML tags), to effectively get the behavior that you're describing. Does that sound okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had read that, and Juliette has the same concerns that I do.

My concern, is that the person writing the code, may not know of any localised link, but is being a good citizen by making the URL available for translation. Yet, the code in this PR is penalising them for that.

If instead of:

__( '<a href="http://wordpress.org">WordPress</a>', 'my-slug' );

It was:

__( '<a href="http://wordpress.org">WordPress</a> Rocks!', 'my-slug' );

then this sniff would have no issue with the hard-coded URL.

I'm all for the rest, and I know this currently has a different violation code, but right now, I'd rather let any strings wrapped in <a ...>...</a> to not be flagged, regardless of attribute values or placeholders, and then maybe tighten it up in the future if needed.

Copy link
Member

@akirk akirk Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one that would concern me. By adding a warning, we're effectively recommending that URLs shouldn't be localized, especially when the comments say that this is "bad".

This is incorrect. The recommended approach would be this: <a href="<?php echo esc_url( __( 'https://wordpress.org/' ) ); ?>"><?php _e( 'WordPress Rocks' ); ?></a>, and not "not localizing" the URL.

In the approach you recommend (including the <a> tag incl the href value), translation systems like GlotPress (on translate.wordpress.org) will actually throw a warning if you want to submit a translation with a changed tag (and changing the URL in a tag is changing the tag), so I'd consider this discouraged as well.

This is why I think it's preferrable to wrap the URL separately in gettext and have the sniff advise so (i.e. change the message Strings should not be wrapped in <a href="...">...</a> to something like Strings should not be wrapped in <a href="...">...</a>, please wrap the URL separately).

Copy link
Member

@GaryJones GaryJones Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended approach would be this: <a href="<?php echo esc_url( __( 'https://wordpress.org/' ) ); ?>"><?php _e( 'WordPress Rocks' ); ?></a>, and not "not localizing" the URL.

Your example doesn't match what I put. In mine, WordPress was linked, while Rocks was not. By following your approach, the string would be split.

https://developer.wordpress.org/apis/handbook/internationalization/internationalization-guidelines/#best-practices-for-writing-strings says "Do not leave URLs for translation, unless they could have a version in another language.".

In the case of my example, I'd want to switch it to https://en-gb.wordpress.org.

An examples where it wouldn't be switched, would be when it points to an internal settings page.

At this stage, I wouldn't want to get into analysing the URL itself to decide whether to flag it or not, and so, would still like to see this not flagged for this first pass. With more discussion, and consensus, maybe a variant of this check for a href could be added.

translation systems like GlotPress (on translate.wordpress.org) will actually throw a warning if you want to submit a translation with a changed tag

Correct, along with warnings for other validation failures as well - that's just a way to highlight it to Translation Editors, rather than let potentially spam URLs to make it through to production translation sets. It's not to discourage valid localised URLs from being submitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example doesn't match what I put. In mine, WordPress was linked, while Rocks was not. By following your approach, the string would be split.

This was for the sake of an argument. The whole point of this PR is to encourage leaving out as much as HTML possbile from Gettext string. Your string of argument is distracting from this approach.

Correct, along with warnings for other validation failures as well - that's just a way to highlight it to Translation Editors, rather than let potentially spam URLs to make it through to production translation sets. It's not to discourage valid localised URLs from being submitted.

We can disagree about the reasons behind the validation failure. In general your approach encourages translators to change HTML tags which in my opinion should not be their business and they should actually be protected. It's easy to introduce invalid HTML and just dismiss the warning. So actually the fact that you can override warnings in GlotPress is not helpful in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to encourage leaving out as much as HTML possbile from Gettext string.

Your definition of "as possible" and mine clearly differ. I'll refer to my previous comment on the issue as for why.

In general your approach encourages translators to change HTML tags which in my opinion should not be their business and they should actually be protected.

Your opinion disagrees with the gettext manual, WP best practices document, and one or two handbooks.

Besides that, the prior art here is that the translators using GlotPress have been handling (wrapping) markup in strings from developers who didn't know better for a number of years now. That, on its own, isn't necessarily reason enough to continue to allow it, but I can't see what major issue has happened that would warrant the overly strict set of checks that you want to see imposed in the first pass.

As I've said, I'm happy to continue this discussion for wrapping anchor links, but by moving it outside of this PR, we can unblock the majority of the benefit that the rest of it brings.

WordPress/Tests/WP/I18nUnitTest.1.inc Outdated Show resolved Hide resolved
Copy link
Member

@GaryJones GaryJones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting that strings wrapped with anchor tags are not flagged, for this first improvement of the sniff.

@ockham
Copy link
Contributor Author

ockham commented Feb 19, 2020

Removed the NoAHrefWrappedStrings rule; strings wrapped in <a href="...">...</a> now pass the check.

Copy link
Member

@GaryJones GaryJones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ockham Thanks for making those changes.

@jrfnl
Copy link
Member

jrfnl commented Feb 24, 2020

Sorry I've been quiet over the last week. I'm in full conference prep mode and will be traveling for the next two weeks, so won't have a chance to have another look until after that time.

Though I haven't reviewed the latest version of the technical implementation, I do believe my concerns about the actual functionality have been addressed.

I'll leave it up to the other committers to decide whether they feel confident merging this as it is now or whether they want to wait until I'm back. They have my blessing.

@GaryJones GaryJones dismissed jrfnl’s stale review February 25, 2020 17:57

Dismissing so merge can happen.

@GaryJones GaryJones merged commit 1981c66 into WordPress:develop Feb 25, 2020
@ockham ockham deleted the add/i18n-html-tag-sniff branch February 26, 2020 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants