Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15

Open
Zegnat opened this issue Jan 12, 2018 · 15 comments

Comments

@Zegnat
Copy link
Member

Zegnat commented Jan 12, 2018

Summary

At several points the parsing specification says to return the textContent, but it never defines what this means. I personally always assumed the DOM textContent property for the current element, but this does not seem to match with what parsers have been doing.

Discussion

@aaronpk wrote a blogpost today containing the following, emphasis mine:

I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can’t be sure exactly what method was used to generate the plaintext value from the original HTML anyway.

I replied to the emphasised statement in chat:

DOM’s textContent should be used, IIRC, else the parser is broken.

This started a discussion in the #indieweb-dev chat that is best read in the chat logs. The discussion continued in the #micoformats chat. The important take-away is that the PHP parser includes its own text extraction implementation, after an issue was filed by a user that was missing expected white space in the output.

It turned out that the JavaScript parser (glennjones/microformat-shiv) was already doing something like that.

The important part here is user expectation. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a <br> used to be. It is also what aaronpk would expect. From chat:

no, I would definitely expect newlines in the plaintext
given that's how a browser will render it
and if you copypaste the text from the browser it will have newlines

I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.

If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.

Parser behaviour

Test:

<div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>

Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.

PHP

        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Wow\nThis Is Interesting"
                ]
            }
        }

Python

  {
   "type": [
    "h-entry"
   ], 
   "properties": {
    "name": [
     "WowThisIs Interesting"
    ]
   }
  }

Go

    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "name": [
          "WowThisIs Interesting"
        ]
      }
    }
@Zegnat Zegnat changed the title Define the value of textContent. What should mf2 textContent parsing result in? User expectation vs. DOM specification. Jan 12, 2018
@gRegorLove
Copy link
Member

gRegorLove commented Jan 12, 2018

Node

via https://glennjones.net/tools/microformats/ ("Experimental ‐ Text white-space collapsing" option not checked, though even when it's checked it does not seem to change with this example.)

{
    "type": ["h-entry"],
    "properties": {
        "name": ["WowThisIs Interesting"]
    }
}

@aaronpk
Copy link
Member

aaronpk commented Jan 12, 2018

The actual result I would expect is what is rendered by a browser:

Wow
This

Is Interesting

Here it is in Lynx

screenshot 2018-01-12 11 00 17

https://pin13.net/mf2/?id=20180112185913831

@Zegnat
Copy link
Member Author

Zegnat commented Mar 23, 2018

It only took 2 months, but I have written a draft specification for handling textContent in the microformats context.

The algorithm combines:

I have also implemented it in JavaScript so it can be live tested and hereby announce I am willing to implement it into php-mf2 ASAP.

Please have a look at the text content from HTML page for the live test and the algorithm.

I will probably be moving it to the microformats wiki Soon™. It can then be linked to from other specs. E.g. #20 could be fixed simply by having the vcp spec point at the text content algorithm for its “innertext”.

@Zegnat
Copy link
Member Author

Zegnat commented Mar 25, 2018

While implementing this in PHP, I ran into a little snag where a line break was being preserver at the start of the resulting string. So I have updated the algorithm to strip “leading and trailing ASCII whitespace from output” instead of removing “any leading and trailing U+0020 SPACE code points from output.”

I am now throwing more tests at it to see if I should just use ASCII whitespace more often than limiting actions to just spaces etc.

@Zegnat
Copy link
Member Author

Zegnat commented Mar 29, 2018

Thought: @sknebel just wondered if this should be tweaked so whitespace is not collapsed within PRE elements. I wonder what the user expectation is there.

@kartikprabhu
Copy link
Member

How does one deal with <pre> elements
Example:

<article class="h-entry">
    <div class="e-content">
    <p>Hello<br>
    World</p>
    <pre>
    this is some pre formatted text

	this is more pre formatting
    </pre>
    </div>
</article>

what should be the content>value?

@Zegnat
Copy link
Member Author

Zegnat commented Apr 3, 2018

According to my browser’s innerText method the plain text of e-content in that example is:

Hello
World

    this is some pre formatted text

	this is more pre formatting
    

Or the following after JSON.stringify (to clearly show whitespace):

"Hello\nWorld\n\n    this is some pre formatted text\n\n\tthis is more pre formatting\n    "

@jgarber623
Copy link
Member

Adding a link to issue #83 on indieweb/microformats-ruby filed by @aaronpk back in March of this year, which is related to whitespace parsing.

@snarfed
Copy link
Member

snarfed commented Jul 21, 2018

@Zegnat this is great! what's the status? time to move this to the microformats wiki?

@gRegorLove
Copy link
Member

gRegorLove commented Jul 25, 2018

In #microformats today there was some discussion about <pre> and whitespace. The results of this algorithm don't match with @aaronpk's blankspace-test #11 (from what I can tell that test came after the algorithm).

@snarfed's example as parsed by latest php-mf2:

<div class="h-entry">
  <div class="e-content p-name">
    Hello World
    <pre>
      one
      two
      three
    </pre>
  </div>
</div>
"items": [
    {
        "type": [
            "h-entry"
        ],
        "properties": {
            "name": [
                "Hello World one two three"
            ],
            "content": [
                {
                    "html": "Hello World\n    <pre>\n      one\n      two\n      three\n    <\/pre>",
                    "value": "Hello World one two three"
                }
            ]
        }
    }
]

I don't have a strong opinion about what's "right" here, but at a glance that p-name and e-content.value without newlines and tabs looks nice from a (hypothetical) consumer perspective.

@snarfed
Copy link
Member

snarfed commented Jul 25, 2018

hrm. overriding pre in name might be ok, but almost certainly not in content.value. whitespace inside pre is meaningful the same way tags like br are meaningful, and need to be preserved, as lots of people have argued both here and in applications like bridgy.

@sknebel
Copy link
Member

sknebel commented Jul 25, 2018

Whitespace in <pre> IMHO has to be preserved, with the exception of stripping newlines at the start if it is at the start of the property value and stripping whitespace at the end if it is at the end.

Copying comment I made on a PR regarding <pre>:

I believe there should be a newline before and after <pre>, because <pre> is by default in browsers and in the use cases I can think of in actual posts (code, ASCII art) styled as display: block, meaning it does not stand in line with other content. From my testing, these newlines get collapsed if the <pre>content already starts/ends with one.

@Zegnat
Copy link
Member Author

Zegnat commented Jul 25, 2018

@snarfed:

@Zegnat this is great! what's the status? time to move this to the microformats wiki?

Yes, I want to move it to the wiki ASAP, and update it with some of @kartikprabhu’s work. So we can do further iteration of the algorithm there, and have its history preserved. I had planned to have it done already, but currently on holiday and internet connectivity has been spotty.

I should be back this coming weekend and will be catching up on all things microformats next week!

@gRegorLove
Copy link
Member

@snarfed @sknebel Good points. To keep things simple, I would prefer not adding an exception for p-name. +1 for preserving whitespace inside <pre>

sknebel pushed a commit to sknebel-forks/microformats-whitespace-tests that referenced this issue Jul 25, 2018
Per recent discussions (microformats/microformats2-parsing#15 (comment)) and my proposal there to treat <pre> similar to <p>/display: inline-block as in browsers
@Lewiscowles1986
Copy link

Lewiscowles1986 commented Apr 25, 2020

I'm fairly certain the text element should be calculated before output rather than stored.

Here is an example using Zegnat tool on a portion of https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute

I've imagined a <section> element or <div> as an enclosing context

Screenshot of Firefox display on original site

Screenshot from 2020-04-25 13-25-16

HTML input, with outer div omitted
<h4 id="the-innertext-idl-attribute"><span class="secno">3.2.7</span> The <code id="the-innertext-idl-attribute:dom-innertext"><a href="#dom-innertext">innerText</a></code> IDL attribute<a href="#the-innertext-idl-attribute" class="self-link"></a></h4>
<div class="status"><input onclick="toggleStatus(this)" value="" type="button"><p class="support"><strong>Support:</strong> innertext<span class="and_chr yes"><span>Chrome for Android</span> <span>80+</span></span><span class="chrome yes"><span>Chrome</span> <span>4+</span></span><span class="ios_saf yes"><span>iOS Safari</span> <span>4.0+</span></span><span class="firefox yes"><span>Firefox</span> <span>45+</span></span><span class="safari yes"><span>Safari</span> <span>3.2+</span></span><span class="samsung yes"><span>Samsung Internet</span> <span>4+</span></span><span class="edge yes"><span>Edge</span> <span>12+</span></span><span class="ie yes"><span>IE</span> <span>6+</span></span><span class="and_uc yes"><span>UC Browser for Android</span> <span>12.12+</span></span><span class="opera yes"><span>Opera</span> <span>9.5+</span></span><span class="op_mini yes"><span>Opera Mini</span> <span>all+</span></span><span class="android yes"><span>Android Browser</span> <span>2.3+</span></span></p><p class="caniuse">Source: <a href="https://caniuse.com/#feat=innertext">caniuse.com</a></p></div>
<aside class="mdn-anno wrapped"><button onclick="toggleStatus(this)" class="mdn-anno-btn"><b title="Support in all current engines." class="all-engines-flag"></b><span>MDN</span></button><div class="feature"><p><a href="https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText" title="The innerText property of the HTMLElement interface represents the &quot;rendered&quot; text content of a node and its descendants.">HTMLElement/innerText</a></p><p class="all-engines-text">Support in all current engines.</p><div class="support"><span class="firefox yes"><span>Firefox</span><span>45+</span></span><span class="safari yes"><span>Safari</span><span>3+</span></span><span class="chrome yes"><span>Chrome</span><span>1+</span></span><hr><span class="opera yes"><span>Opera</span><span>9.6+</span></span><span class="edge_blink yes"><span>Edge</span><span>79+</span></span><hr><span class="edge yes"><span>Edge (Legacy)</span><span>12+</span></span><span class="ie yes"><span>Internet Explorer</span><span>5.5+</span></span><hr><span class="firefox_android yes"><span>Firefox Android</span><span>45+</span></span><span class="safari_ios yes"><span>Safari iOS</span><span>4+</span></span><span class="chrome_android yes"><span>Chrome Android</span><span>18+</span></span><span class="webview_android yes"><span>WebView Android</span><span>4.4+</span></span><span class="samsunginternet_android yes"><span>Samsung Internet</span><span>1.0+</span></span><span class="opera_android yes"><span>Opera Android</span><span>10.1+</span></span></div></div></aside>
<dl class="domintro"><dt><var>element</var> . <code id="dom-innertext-dev"><a href="#dom-innertext">innerText</a></code> [ = <var>value</var> ]</dt><dd>
    <p>Returns the element's text content "as rendered".</p>

    <p>Can be set, to replace the element's children with the given value, but with line breaks
    converted to <code id="the-innertext-idl-attribute:the-br-element"><a href="text-level-semantics.html#the-br-element">br</a></code> elements.</p>
  </dd></dl>
<p>On getting, the <dfn id="dom-innertext"><code>innerText</code></dfn> attribute must follow
  these steps:</p>
<ol><li>
    <p>If this element is not <a id="the-innertext-idl-attribute:being-rendered" href="rendering.html#being-rendered">being rendered</a>, or if the user agent is a non-CSS user
    agent, then return this element's <a id="the-innertext-idl-attribute:descendant-text-content" href="https://dom.spec.whatwg.org/#concept-descendant-text-content" data-x-internal="descendant-text-content">descendant text content</a>.</p>

    <p class="note">This step can produce surprising results, as when the <code id="the-innertext-idl-attribute:dom-innertext-2"><a href="#dom-innertext">innerText</a></code> attribute is accessed on an element not <a id="the-innertext-idl-attribute:being-rendered-2" href="rendering.html#being-rendered">being
    rendered</a>, its text contents are returned, but when accessed on an element that is
    <a id="the-innertext-idl-attribute:being-rendered-3" href="rendering.html#being-rendered">being rendered</a>, all of its children that are not <a id="the-innertext-idl-attribute:being-rendered-4" href="rendering.html#being-rendered">being rendered</a> have
    their text contents ignored.</p>
   </li><li><p>Let <var>results</var> be a new empty <a id="the-innertext-idl-attribute:list" href="https://infra.spec.whatwg.org/#list" data-x-internal="list">list</a>.</p></li><li>
    <p>For each child node <var>node</var> of this element:</p>

    <ol><li><p>Let <var>current</var> be the <a id="the-innertext-idl-attribute:list-2" href="https://infra.spec.whatwg.org/#list" data-x-internal="list">list</a> resulting in running the <a href="#inner-text-collection-steps" id="the-innertext-idl-attribute:inner-text-collection-steps">inner
     text collection steps</a> with <var>node</var>. Each item in <var>results</var> will either
     be a <a id="the-innertext-idl-attribute:string" href="https://infra.spec.whatwg.org/#string" data-x-internal="string">string</a> or a positive integer (a <i>required line break count</i>).</p>

     <p class="note">Intuitively, a <i>required line break count</i> item means that a certain
     number of line breaks appear at that point, but they can be collapsed with the line breaks
     induced by adjacent <i>required line break count</i> items, reminiscent to CSS
     margin-collapsing.</p>

     </li><li><p>For each item <var>item</var> in <var>current</var>, append <var>item</var> to
     <var>results</var>.</p></li></ol>
   </li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-remove" data-x-internal="list-remove">Remove</a> any items from <var>results</var> that are the
   empty string.</p></li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-remove-2" data-x-internal="list-remove">Remove</a> any runs of consecutive <i>required line break
   count</i> items at the start or end of <var>results</var>.</p></li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-replace" data-x-internal="list-replace">Replace</a> each remaining run of consecutive <i>required
   line break count</i> items with a string consisting of as many U+000A LINE FEED (LF) characters
   as the maximum of the values in the <i>required line break count</i> items.</p></li><li><p>Return the concatenation of the string items in <var>results</var>.</p></li></ol>
RAW JSON encoded text output
3.2.7 The innerText IDL attribute\nSupport: innertextChrome for Android 80+Chrome 4+iOS Safari 4.0+Firefox 45+Safari 3.2+Samsung Internet 4+Edge 12+IE 6+UC Browser for Android 12.12+Opera 9.5+Opera Mini all+Android Browser 2.3+\nSource: caniuse.com ✔MDN\nHTMLElement/innerText\nSupport in all current engines.Firefox45+Safari3+Chrome1+Opera9.6+Edge79+Edge (Legacy)12+Internet Explorer5.5+Firefox Android45+Safari iOS4+Chrome Android18+WebView Android4.4+Samsung Internet1.0+Opera Android10.1+ element . innerText [ = value ]\nReturns the element's text content \"as rendered\".\nCan be set, to replace the element's children with the given value, but with line breaks converted to br elements.\nOn getting, the innerText attribute must follow these steps:\nIf this element is not being rendered, or if the user agent is a non-CSS user agent, then return this element's descendant text content.\nThis step can produce surprising results, as when the innerText attribute is accessed on an element not being rendered, its text contents are returned, but when accessed on an element that is being rendered, all of its children that are not being rendered have their text contents ignored.\nLet results be a new empty list.\nFor each child node node of this element:\nLet current be the list resulting in running the inner text collection steps with node. Each item in results will either be a string or a positive integer (a required line break count).\nIntuitively, a required line break count item means that a certain number of line breaks appear at that point, but they can be collapsed with the line breaks induced by adjacent required line break count items, reminiscent to CSS margin-collapsing.\nFor each item item in current, append item to results.\nRemove any items from results that are the empty string.\nRemove any runs of consecutive required line break count items at the start or end of results.\nReplace each remaining run of consecutive required line break count items with a string consisting of as many U+000A LINE FEED (LF) characters as the maximum of the values in the required line break count items.\nReturn the concatenation of the string items in results.
Output of JSON rendered into HTML pre-formatted text
3.2.7 The innerText IDL attribute
Support: innertextChrome for Android 80+Chrome 4+iOS Safari 4.0+Firefox 45+Safari 3.2+Samsung Internet 4+Edge 12+IE 6+UC Browser for Android 12.12+Opera 9.5+Opera Mini all+Android Browser 2.3+
Source: caniuse.com ✔MDN
HTMLElement/innerText
Support in all current engines.Firefox45+Safari3+Chrome1+Opera9.6+Edge79+Edge (Legacy)12+Internet Explorer5.5+Firefox Android45+Safari iOS4+Chrome Android18+WebView Android4.4+Samsung Internet1.0+Opera Android10.1+ element . innerText [ = value ]
Returns the element's text content \"as rendered\".
Can be set, to replace the element's children with the given value, but with line breaks converted to br elements.
On getting, the innerText attribute must follow these steps:
If this element is not being rendered, or if the user agent is a non-CSS user agent, then return this element's descendant text content.
This step can produce surprising results, as when the innerText attribute is accessed on an element not being rendered, its text contents are returned, but when accessed on an element that is being rendered, all of its children that are not being rendered have their text contents ignored.
Let results be a new empty list.
For each child node node of this element:
Let current be the list resulting in running the inner text collection steps with node. Each item in results will either be a string or a positive integer (a required line break count).
Intuitively, a required line break count item means that a certain number of line breaks appear at that point, but they can be collapsed with the line breaks induced by adjacent required line break count items, reminiscent to CSS margin-collapsing.
For each item item in current, append item to results.
Remove any items from results that are the empty string.
Remove any runs of consecutive required line break count items at the start or end of results.
Replace each remaining run of consecutive required line break count items with a string consisting of as many U+000A LINE FEED (LF) characters as the maximum of the values in the required line break count items.
Return the concatenation of the string items in results.
What Lynx text-mode browser displays

Screenshot from 2020-04-25 13-43-57

My point being that if e-content or p-content were rendered like this, it would be a fairly poor output. Text could be made an optional field, or annotated with some points.

  1. if text conversion is not possible, make it instructions for how to get the content, or allow omission.
  2. if text conversion is possible, the algorithm for generating may not be perfect, or representative of the original content; therefore it makes sense that the u-url within a h-*, or that the h-* was retrieved from in the absence of such a property; be stored for when the algorithm improves.

Further: By suggesting that generation of text (non-source content) be proprietary and merely representative, the spec might more clearly communicate the intent.

I might for example decide that a markdown format is suitable for an audience; or CURSES for text-mode CLI or teletype-compatible targets.

I might for example manually insert ordered list indexes as content with spacing for mobile or plain-text email, and use an asterisk or similar 8-bit compatible character for unordered lists.

Where changing the font is impossible, it may be possible to add newlines, underscores or textual decoration to differentiate content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants