Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Link Reference Definition] Block starts in Link Title that spans multiple lines break the definition #622

Open
movsb opened this issue Nov 1, 2019 · 6 comments

Comments

@movsb
Copy link

movsb commented Nov 1, 2019

What the spec says:

A link title consists of either

  • a sequence of zero or more characters between straight double-quote characters ("), including a " character only if it is backslash-escaped, or

  • a sequence of zero or more characters between straight single-quote characters ('), including a ' character only if it is backslash-escaped, or

  • a sequence of zero or more characters between matching parentheses ((...)), including a ( or ) character only if it is backslash-escaped.

Although link titles may span multiple lines, they may not contain a blank line.

But if block starts like #, >, * start a new line of link title. The Link will be broken.

I see that the reference CommonMark javascript implementation first parses link reference definition lines into a paragraph, then tries to parse it as a link reference definition. But block starts like #,* and > as well as blank lines break Paragraphs. So, the link reference definitions are also broken.

    paragraph: {
        continue: function(parser) {
            return (parser.blank ? 1 : 0);
        },
        finalize: function(parser, block) {
            var pos;
            var hasReferenceDefs = false;

            // try parsing the beginning as link reference definitions:
            while (peek(block._string_content, 0) === C_OPEN_BRACKET &&
                   (pos =
                    parser.inlineParser.parseReference(block._string_content,
                                                       parser.refmap))) {
                block._string_content = block._string_content.slice(pos);
                hasReferenceDefs = true;
            }
            if (hasReferenceDefs && isBlank(block._string_content)) {
                block.unlink();
            }
        },
        canContain: function() { return false; },
        acceptsLines: true
    }

Example link reference definition (dingus link):

[foo]: /url "
first line
# this should be a heading
second line
"

Just think that #, >,*, especially blank lines are common in Titles, for example, a commit message in commonmark.js:

dingus: add dependency version requirements (#159)

* dingus: add dependency version requirements

Dingus was rendering incorrectly with Bootstrap 4. Added a bower.json
which requires Bootstrap, jQuery and Lodash with major version equal
to what's currently live. Likewise the minimum patch version.

* dingus: require lodash ^4.17.11.

what GitHub rendered:

<a data-pjax="true" title="dingus: add dependency version requirements (#159)

* dingus: add dependency version requirements

Dingus was rendering incorrectly with Bootstrap 4. Added a bower.json
which requires Bootstrap, jQuery and Lodash with major version equal
to what's currently live. Likewise the minimum patch version.

* dingus: require lodash ^4.17.11." class="link-gray" href="/commonmark/commonmark.js/commit/a7eb3af149a7e22526a3eb40216504d2ecdd0c8a">dingus: add dependency version requirements (</a>

Adding backslashs seems to work. but the spec doesn't statement this.

If I try to use backslash to escape a new line, the new line preserves (not be a space), so does the backslash. (dingus link)

[foo]: /url "
first line
\# not a heading
\
second line
"

[foo]
@jgm
Copy link
Member

jgm commented Nov 2, 2019

But see https://spec.commonmark.org/0.29/#precedence
which says that indicators of block structure always take precedence over indicators of inline structure.

@jgm
Copy link
Member

jgm commented Nov 2, 2019

Related issue: #605.

I think the underlying issue here is that the reference implementations parse these as paragraphs, for parsing efficiency, but they aren't actually spec'd as paragraphs. So the spec may need some tightening up in this respect.

@movsb
Copy link
Author

movsb commented Nov 2, 2019

@jgm Thanks for replying.

that indicators of block structure always take precedence over indicators of inline structure.

In the Spec, the Link Reference Definition is under the Leaf blocks section, so they have the same precedence?

First I find it hard to use a state machine to parse definitions, so I look into how commonmark.js implements this. Now I know it treats Link Reference Definition as an only inline of a paragraph.

I think using backslashes may be a good solution as long as we carefully process the backslash with new line?

@wooorm
Copy link
Contributor

wooorm commented Nov 2, 2019

@movsb I am currently writing a state machine based spec and implementation on how to parse CM, this is what lead me to this issue as well.
The way I do definitions and paragraphs is first parse "content" (this is both link definitions and paragraphs).
When content is closed, it is first parsed for zero or more definitions, the rest is phrasing.
If there is phrasing, and the content was closed because of a setext heading underline, the phrasing becomes part of a heading. If there was no phrasing, the setext heading underline itself becomes the start of a new content.
Otherwise, if the content was closed normally, then the phrasing becomes part of a paragraph.
Furthermore, I use four state machines: one for flow constructs ("block"), one for content, one for rich text ("inline"), and one for plain text (text with escapes and character references)

@movsb
Copy link
Author

movsb commented Nov 2, 2019

@wooorm Oh, look forward to seeing that.

A question: what conditions will make a content closed? A blank line?

My state machine is only for definitions parsing. It reports a state whether blank lines can be continue lines.

for each line of lines:
    switch state:
        case label:
        case url:
        case title:
        case title_span_lines:
              return need_more_line
  • If the state machine is not yet finished, every single line is sent to it.
  • If the state machine fails, every line it consumed are reconstructed into a paragraph.

It's a WIP, don't know if it works.

BTW, Parsing blank lines is the most tricky thing in my implementation.

@wooorm
Copy link
Contributor

wooorm commented Nov 2, 2019

This conversation is a bit off topic, but thematic breaks, atx headings, setext heading underlines, fenced code blocks, HTML blocks (except type 7), blank lines, block quotes, and lists/list-items all close content. Indented code does not.

I personally don’t see blank lines as difficult, so I’m unsure what you mean calling them tricky?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants