Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and recover structure of extras #48

Merged
merged 13 commits into from
Sep 18, 2024
Merged

Extract and recover structure of extras #48

merged 13 commits into from
Sep 18, 2024

Conversation

mjambon
Copy link
Member

@mjambon mjambon commented Sep 15, 2022

Update (2024): we're seeing strange parsing errors in semgrep with the html parser. It seems unrelated to the work on extras but it's blocking us. The semgrep PR is https://github.com/semgrep/semgrep-proprietary/pull/2260

Closes #2

Extras are nodes that can occur anywhere in the CST returned by tree-sitter. They used to be completely ignored. Usually, they are not needed because they're comments. However, the heredoc syntax in Ruby and Bash calls for parsing the body of the template as an extra. The syntax is roughly "the token << indicates that the contents starting on the next line is the body of the heredoc template". The main issue is that some material can follow the marker << on the same line. There can even be multiple << on the same line. In Bash, for example, we can do this:

Simple case:

$ cat <<A > output_a.txt
This goes to output_a.txt.
A

Extreme case:

$ cat <<A > output_a.txt; cat <<B > output_b.txt
This goes to output_a.txt.
A
This goes to output_b.txt.
B

This results in the insertion of the heredoc body where it occurs, i.e. not right after the marker << but later, at a spot that can't be specified by a pure tree-sitter grammar.

This PR allows the extraction and structure recovery of extra nodes independently from the grammar. When translating the CST to a more useful tree, the programmer will need to write code that attempts to match the opening heredoc marker with the next heredoc body that makes sense.

The following choices were made:

  • all the extras are grouped under a single sum type because it's simpler to implement e.g.
type extra = Comment of comment | Heredoc of heredoc
type extras = extra list

instead of

type extras = { comments : comment list; heredocs : heredoc list }

I think the latter would require either more code generation (generate code to convert the variant list to the record of lists) or would be done less efficiently (multiple passes over the whole tree).

  • the root of each extra comes with a location. This is convenient to for pairing a heredoc opening with the correct heredoc body. In fact, having a location for each node is desirable but would require changing a lot of boilerplate in semgrep, so we might want to do this another day.

Here's what we have to the test extras:

module.exports = grammar({
  name: "extras",
  rules: {
    program: $ => repeat(
      $.number
    ),
    number: $ => /[0-9]+/,
    letter: $ => /[a-zA-Z]/,
    comment: $ => /#.*/,
    complex_extra: $ => seq('(', repeat(choice($.number, $.letter)), ')'),
  },
  extras: $ => [
    $.comment,       // appears in the CST
    $.complex_extra, // same
    "IGNOREME",  // doesn't appear in the CST because it doesn't have a name
    /\s|\\\n/    // same
  ]
})

Test input:

~/ocaml-tree-sitter-core/test/extras $ cat test/ok/complex_extra 
# first comment
1 2 (3 A # extra in the middle of another extra!
 4) 5

Test output (original tree-sitter output followed by the recovered typed representation):

$ ./parse test/ok/complex_extra 
CST obtained from the tree-sitter parser:
program:
| comment
| number
| number
| complex_extra:
|   "("
|   number
|   letter
|   comment
|   number
|   ")"
| number
---
Recovered typed CST:
| 1
| 2
| 5
Extras:
|   Comment
|   "0:0-0:15"
|   "# first comment"
|   Complex_extra
|   "1:4-2:3"
|   | "("
|   |   | Num
|   |   | 3
|   |   | Letter
|   |   | A
|   |   | Num
|   |   | 4
|   | ")"
|   Comment
|   "1:9-1:48"
|   "# extra in the middle of another extra!"
total lines: 4
error lines: 0
error count: 0
success: 100.00%

test plan: make && make install && make test will catch most of the problems that are caused by bad code generation.

Security

  • Change has no security implications (otherwise, ping the security team)

@mjambon mjambon marked this pull request as draft September 15, 2022 06:17
@mjambon mjambon requested a review from nmote September 15, 2022 23:48
@mjambon mjambon marked this pull request as ready for review September 15, 2022 23:48
@mjambon mjambon marked this pull request as draft September 15, 2022 23:49
@mjambon mjambon requested a review from a team September 15, 2022 23:49
@aryx
Copy link
Contributor

aryx commented Sep 18, 2022

Why does the Ruby and Bash grammars use extra to represent the body part of an heredoc?
How the other languages handle heredocs? PHP has heredocs and we don't have problems there? Or do we?

@mjambon
Copy link
Member Author

mjambon commented Sep 22, 2022

Why does the Ruby and Bash grammars use extra to represent the body part of an heredoc?

It thought it would be clear from the example. A heredoc body starts on the line after the line that contains the heredoc opening marker. It does not necessarily start right after the marker, which can be followed by other instructions or fragments of the current instruction as a long as we stay on the same line.

The Bash grammar actually doesn't do yet this but it should.

How the other languages handle heredocs? PHP has heredocs and we don't have problems there? Or do we?

According to https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.heredoc, PHP heredocs are sane, i.e. a newline is required right after the opening marker.

@mjambon
Copy link
Member Author

mjambon commented Sep 10, 2024

I updated the PR so it works again. I had to update how we generate dumpers for extras. I'm now looking into regenerating all the grammars in ocaml-tree-sitter-semgrep.

@mjambon mjambon marked this pull request as ready for review September 17, 2024 02:24
@mjambon mjambon requested a review from aryx September 18, 2024 03:01
@mjambon mjambon merged commit e063ec5 into main Sep 18, 2024
2 checks passed
@mjambon mjambon deleted the mj-parse-extras branch September 18, 2024 16:38
semgrep-ci bot pushed a commit to semgrep/semgrep that referenced this pull request Sep 18, 2024
…oprietary#2260)

This changes the type of tree-sitter parsing results. This required
regenerating all the grammars. Passing CI tests with this pull requests
validates the other two pull requests:
* semgrep/ocaml-tree-sitter-core#48
* semgrep/ocaml-tree-sitter-semgrep#510

test plan: `make test`

synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
semgrep-ci bot pushed a commit to semgrep/semgrep that referenced this pull request Sep 19, 2024
…oprietary#2260)

This changes the type of tree-sitter parsing results. This required
regenerating all the grammars. Passing CI tests with this pull requests
validates the other two pull requests:
* semgrep/ocaml-tree-sitter-core#48
* semgrep/ocaml-tree-sitter-semgrep#510

test plan: `make test`

synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
nmote pushed a commit to semgrep/semgrep that referenced this pull request Sep 19, 2024
…oprietary#2260)

This changes the type of tree-sitter parsing results. This required
regenerating all the grammars. Passing CI tests with this pull requests
validates the other two pull requests:
* semgrep/ocaml-tree-sitter-core#48
* semgrep/ocaml-tree-sitter-semgrep#510

test plan: `make test`

synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Return list of "extras", nodes like comments that can occur anywhere
2 participants