Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render integer values in <sup> simply #590

Closed
ietf-svn-bot opened this issue Feb 2, 2021 · 33 comments
Closed

Render integer values in <sup> simply #590

ietf-svn-bot opened this issue Feb 2, 2021 · 33 comments

Comments

@ietf-svn-bot
Copy link

owner:jennifer@painless-security.com resolution_fixed type_enhancement | by martin.thomson@gmail.com


We have a number of places in QUIC that we are using 2^15 and similar. Using 2<sup>15</sup> makes the HTML rendering much nicer, but the text then renders as 2^(15).

A small tweak might improve rendering with no real loss of fidelity. Patch inbound.


Issue migrated from trac:590 at 2022-02-08 07:12:21 +0000

@ietf-svn-bot
Copy link
Author

@rjsparks@nostrum.com commented


Thanks for the patch.
I think we should expand on the regex to match any single token, not just an integer.
See also #574.

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com uploaded file 0001-Render-integer-superscripts-simply.patch (1.4 KiB)

Render integer subscripts simply

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


Ahh, I didn't see that one.

I can easily change the pattern matching here, but it's not clear what the rules would be for deciding. Are you thinking ^(?:-?\d+|\w+)$? That would capture integers or single "words", according to regex definitions.

@ietf-svn-bot
Copy link
Author

@lars@eggert.org commented


Could we make the same change to <sub>? It would improve rendering of the subscripts of the math variables in https://ntap.github.io/rfc8312bis/draft-eggert-tcpm-rfc8312bis.txt

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com uploaded file 0001-Render-integer-super-sub-scripts-simply.patch (2.2 KiB)

Simple rendering for super- and sub-scripts

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed status from new to assigned

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed owner from `` to jennifer@painless-security.com

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


One exceptional case that jumps out at me is the case where someone explicitly wants parentheses in the HTML output. E.g., <sup>(x + y)</sup>. This will render as 2^((x+y)). This would also affect any other brackets.

As this is a problem with the current text renderer and solving this as a general problem is tricky, perhaps that's best left as a separate issue. (Or put aside entirely.)

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


How would you feel if I simplify the pattern to ^\w+$. This matches integers and words, but keep the parentheses for signed numbers or decimals. I think punctuation in the super/subscripted expression can be confusing. Between, e.g., 2^(3.0) and 2^3.0, I find the former to be clearer. Also, 2^(-1) vs 2^-1.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed _comment0 which not transferred by tractive

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


Very sorry for the spam, but I just ran the tests with patched code and I'm not enamored of the results:

-   This is regular text.  This is s_(ubscript).  This is s^(uperscript).
+   This is regular text.  This is s_ubscript.  This is s^uperscript.

and

-    | The _quick_ ^(brown) _(fox) *jumps* over    | Paragraph 1       |
+    | The _quick_ ^brown _fox *jumps* over the    | Paragraph 1       |

The examples are contrived, so in actual use things might turn out clearer. Looking at the subscript examples lars@eggert.org pointed to,

  • _W_(max)_ will become _W_max_
  • W_(cubic)(_t_ + _RTT_) will become W_cubic (_t_ + _RTT_)

I wonder if it might be preferable to keep the parentheses except for integers. I'm happy to do it either way, just wanted to point this out to be sure the effect is what's desired.

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


I find that the W_cubic example is better, but it isn't clear why the values for t and RTT are underlined in that way. Mixing subscripts and other underscores in that way ends up looking odd, but that might be something Lars can work through. Changing the W_max example is probably something Lars can do though.

This works very nicely for the numbers in QUIC. Much better than with the parentheses.

I do think that maybe we could remove '_' from the set of characters that was otherwise in \w to avoid creating confusion in rendering, but otherwise, I think that this is good. I think that authors will simply need to be aware of how this renders in text and adjust. Just like they probably shouldn't mix a literal '^' and ''.

@ietf-svn-bot
Copy link
Author

@lars@eggert.org commented


It's _W_(max)_ because the markdown source is *W<sub>max</sub>*, i.e., the markdown formats the variable name in the body of the text in italics, to match the styling of SVG math renderer. Ditto for _t_ and _RTT_, the math renderer uses italics for variables, and I am trying to reproduce that.

I'd prefer that italics in text form didn't get rendered with underscores and instead simply became plain text, but that needs a separate issue filed.

@ietf-svn-bot
Copy link
Author

@lars@eggert.org changed _comment0 which not transferred by tractive

@ietf-svn-bot
Copy link
Author

@lars@eggert.org changed _comment1 which not transferred by tractive

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


This sounds good - makes sense that people will need to be careful, since there's only so much that can be done to typeset things unambiguously. I agree that keeping parentheses if the expression includes an underscore is a good idea.

I think that the pattern ^[+-]?\d*\.?[a-zA-Z0-9]*$ captures what we've discussed.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed status from assigned to closed

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed resolution from `` to fixed

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


Fixed in 65f2676:

Simplify text rendering of super/subscripts. Based on patch submitted by martin.thomson@gmail.com. Fixes #590. Commit ready for merge.

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


Hi Jennifer,

You have:

        return re.match(r'^[+-]?\d*\.?[a-zA-Z0-9]*$', expr) is not None 

I don't think that is good as it allows for some weird patterns. Like '^+.word', '^+', '^23.stuff', or the empty string: '^'. I would have thought that it would be better to keep numbers and words distinct and require at least one character:

        return re.match(r'^(?:[+-]?\d+(?:\.\d+)?|[a-zA-Z0-9]+)$', expr) is not None 

This doesn't allow for an empty digit string in any position for a number, nor does it allow for the string overall to be empty as your pattern did.

Not using \w means that this loses the ability to have a unicode character in super-/sub-script, which is probably worth noting.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


Yes, the empty string should be rejected.

The other examples are wonky, but seem contrived. If someone is using notation like that, adding parentheses to the mix is as likely to confuse the meaning as to clarify it. The reason I accept those is because it also accepts things like 2^-2n without parentheses. That seems to me less in need of parentheses than, e.g., 2^-3.14159.

So I think we should perhaps take a step back, decide what we would like to accept as a token first, then implement to that.

A few cases that have come up - I'd appreciate your thoughts on these or any I've overlooked.

Ones we seem to agree clearly do not need parentheses:

  • integers
  • non-integer decimals (at least one digit on either side of the decimal point)
  • ASCII letter/digit strings
  • positive and negative signs on numeric values

Things that may or may not need parentheses (but we don't clearly agree):

  • non-integer decimals (one side or empty, e.g. <sup>.5</sup> or <sup>1.</sup>)
  • non-integer decimals with letter/digit strings (<sup>0.5x</sup>)
  • positive and negative signs for letter/digit strings (<sup>-num</sup> or +x`)
  • unicode \w strings

Things we seem to agree clearly do need parentheses:

  • anything with non-\w characters
  • anything with an underscore

Regarding unicode, I'm inclined to keep the parentheses - I'm not sure that there's a good way to know that a character is going to be confusing without them, so it seems prudent to assume the worst. It might be nice to handle common cases, such as Greek characters, but that seems like a big project to handle well.

For decimal points without digits on one side, my inclination is to keep them. They're poor style, but I don't know that they are any less readable without the parentheses. I don't feel terribly strongly about this, though.

I do think accepting signs for things like <sup>-2n</sup> is desirable.

Sorry for the long message - I don't mean to draw this out, but it's a tricky feature and I think being deliberate will avoid revisiting it more than necessary.

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


Thanks Jennifer, that makes sense. On your questionable ones:

non-integer decimals (one side or empty, e.g. .5 or 1.)

Prefer parens, I think, but only weakly.

non-integer decimals with letter/digit strings (0.5x)

Prefer no parens, yeah.

positive and negative signs for letter/digit strings (-num or +x`)

Prefer no parens on -, don't care about + (it's weird, so I'm OK either way).

unicode \w strings

Prefer no parens; we could just filter out underscore. The reason is to deal with the math stuff Lars is doing, where 2^α seems pretty reasonable.

Does that help?

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com changed _comment0 which not transferred by tractive

@ietf-svn-bot
Copy link
Author

ietf-svn-bot commented Feb 11, 2021

@lars@eggert.org commented


Replying to ietf-svn-conversion/xml2rfc#590 (comment:13):

Regarding unicode, I'm inclined to keep the parentheses - I'm not sure that there's a good way to know that a character is going to be confusing without them, so it seems prudent to assume the worst. It might be nice to handle common cases, such as Greek characters, but that seems like a big project to handle well.

Given that sub/sup are almost always going to be used for math, I think allowing "mathy" Unicode characters such as Greek letters would be very useful.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


Thanks for your thoughts. I'm sold on parenthesizing the bare decimal points and on accepting unicode words. I think accepting plus signs is worthwhile - it's not common, but comes up sometimes and basing the rule on its being a sign character seems to me less likely to be surprising.

Rather than trying to write all this in a RE pattern, I've expanded the is_simple_expression() method - I think this makes it more understandable. I've made it unicode-aware, but have not found a way to enter unicode characters that are rendered by the <sup> (they turn into "&#" code points when I try the straightforward way).

I have added a check that avoids doubling up if the expression is already delimited by parentheses (so that <sup>(x+y)</sup> won't become ^((x+y)))

    def is_simple_expression(expr):
        """Can this expression be rendered without adding parentheses?"""
        def already_parenthesized(s):
            """Is the string enclosed in parentheses?

            Only considers parentheses, not other brackets. Good enough to avoid
            pointlessly doubling the parentheses, not to decide that the expression
            makes mathematical sense.
            """
            if not (len(s) >= 2 and s[0] ## '(' and s[-1] ')'):
                return False
            count = 0
            for c in s[1:-1]:
                count += 1 if c ## '(' else -1 if c ')' else 0
                if count < 0:
                    return False
            return count == 0

        expr = expr.strip()

        # Avoid (( )) if the entire expression is already in balanced parentheses
        if already_parenthesized(expr):
            return True

        # Underscore is a `\w` character, so explicitly reject it
        if '_' in expr:
            return False

        # Leading sign is allowed, so ignore it for further tests. Accept unicode
        # sign chars '\u2212' (negative sign), '\u00b1' (plus/minus), '\u2213' (minus/plus),
        # '\ufe63' (small minus),'\uff0b' (full-width plus), '\uff0d' (full-width minus)
        if expr and expr[0] in '+-\u2212\u00b1\u2213\ufe63\uff0b\uff0d':
            expr = expr[1:]

        # Empty or all-whitespace after removing sign must have parentheses for clarity
        if len(expr) == 0:
            return False

        # Regex accepts possibly decimal number followed by mixed word characters.
        # Assumes already removed sign and checked for empty string.
        return re.match(r'^(?:\d+(?:\.\d+)?)?\w*$', expr) is not None

To give you an idea of what this does, for the following input

          <t>2<sup>15</sup> 2<sup>-15</sup><sup>+15</sup></t>
          <t>2<sup>3.0</sup> 2<sup>-3.0</sup> 2<sup>+3.0</sup></t>
          <t>2<sup>(x+y)</sup> 2<sup>-(x+y)</sup></t>
          <t>2<sup>2n</sup> 2<sup>-2n</sup></t>
          <t>this is s<sup>uperscript</sup></t>
          <t>this is s<sup>-trange</sup></t>
          <t>this is <sup>multiple words</sup></t>
          <t>W<sub>max</sub></t> <t>W<sub>max_0</sub></t>
          <t><sup>+.word</sup> <sup>23.stuff</sup> <sup></sup> <sup>   </sup> <sup>-</sup></t>

it renders to

   2^15 2^-15^+15

   2^3.0 2^-3.0 2^+3.0

   2^(x+y) 2^(-(x+y))

   2^2n 2^-2n

   this is s^uperscript

   this is s^-trange

   this is ^(multiple words)

   W_max

   W_(max_0)

   ^(+.word) ^(23.stuff) ^() ^() ^(-)

What do you think?

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


Love it. Thanks for doing this.

Given the leading +/- check, why not this ordering?

        # Leading sign is allowed, so ignore it for further tests. Accept unicode
        # sign chars '\u2212' (negative sign), '\u00b1' (plus/minus), '\u2213' (minus/plus),
        # '\ufe63' (small minus),'\uff0b' (full-width plus), '\uff0d' (full-width minus)
        if expr and expr[0] in '+-\u2212\u00b1\u2213\ufe63\uff0b\uff0d':
            expr = expr[1:]

        # Avoid (( )) if the entire expression is already in balanced parentheses
        if already_parenthesized(expr):
            return True

        # Underscore is a `\w` character, so explicitly reject it
        if '_' in expr:
            return False

        # Empty or all-whitespace after removing sign must have parentheses for clarity
        if len(expr) == 0:
            return False

That would change 2<sup>-(x+y)</sup> to 2^-(x+y).

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


I went back and forth on that. I'm happy to do it the other way.

However, one thing I've realized while thinking about that is that we need to think about spaces. The issue:

x<sub>0</sub><sup>n</sup>y<sub>0</sub><sup>m</sup>

becomes

x_0^ny_0^m

which, in addition to looking like strange ascii art, is pretty ambiguous. I'm not sure how to handle this. The simple thing would be to change the render_sup method to use '^%s ' (note the spaces after the s), but that will cause artifacts like:

My sentence is x_0 ^n y_0 ^m .

(spaces between sub/sup and before the sentence period)

I suppose this is another case where we could leave it to the author to know that spaces are needed - certainly that'd be understood by LaTeX users.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


Ok - I had a look at the output of the HTML writer and found that its results without a space between factors also look a bit odd. With a space, they are much more readable. Based on that, I'm not going to worry about the lack of a trailing space in the text writer and leave it to the author to insert one.

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com changed _comment0 which not transferred by tractive

@ietf-svn-bot
Copy link
Author

@jennifer@painless-security.com commented


FYI, the additional work has now been committed in 28d2f44

@ietf-svn-bot
Copy link
Author

@martin.thomson@gmail.com commented


Thanks Jennifer, this is a nice improvement.

@ietf-svn-bot
Copy link
Author

@rjsparks@nostrum.com commented


Fixed in 0979a66:

Merged in 65f2676 and 28d2f44 from jennifer@painless-security.com:\n Simplify text rendering of super/subscripts. Based on patch submitted by martin.thomson@gmail.com and refinement from subsequent list discussion. Fixes #590.

@ietf-svn-bot
Copy link
Author

The attachments for these issues were lost in trac before the transition to github, and cannot be recovered. If the issue is still relevant, and the attachments can be reconstructed, please add them as new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant