Extract and recover structure of extras #48

mjambon · 2022-09-15T06:17:45Z

Update (2024): we're seeing strange parsing errors in semgrep with the html parser. It seems unrelated to the work on extras but it's blocking us. The semgrep PR is https://github.com/semgrep/semgrep-proprietary/pull/2260

Closes #2

Extras are nodes that can occur anywhere in the CST returned by tree-sitter. They used to be completely ignored. Usually, they are not needed because they're comments. However, the heredoc syntax in Ruby and Bash calls for parsing the body of the template as an extra. The syntax is roughly "the token << indicates that the contents starting on the next line is the body of the heredoc template". The main issue is that some material can follow the marker << on the same line. There can even be multiple << on the same line. In Bash, for example, we can do this:

Simple case:

$ cat <<A > output_a.txt
This goes to output_a.txt.
A

Extreme case:

$ cat <<A > output_a.txt; cat <<B > output_b.txt
This goes to output_a.txt.
A
This goes to output_b.txt.
B

This results in the insertion of the heredoc body where it occurs, i.e. not right after the marker << but later, at a spot that can't be specified by a pure tree-sitter grammar.

This PR allows the extraction and structure recovery of extra nodes independently from the grammar. When translating the CST to a more useful tree, the programmer will need to write code that attempts to match the opening heredoc marker with the next heredoc body that makes sense.

The following choices were made:

all the extras are grouped under a single sum type because it's simpler to implement e.g.

type extra = Comment of comment | Heredoc of heredoc
type extras = extra list

instead of

type extras = { comments : comment list; heredocs : heredoc list }

I think the latter would require either more code generation (generate code to convert the variant list to the record of lists) or would be done less efficiently (multiple passes over the whole tree).

the root of each extra comes with a location. This is convenient to for pairing a heredoc opening with the correct heredoc body. In fact, having a location for each node is desirable but would require changing a lot of boilerplate in semgrep, so we might want to do this another day.

Here's what we have to the test extras:

module.exports = grammar({
  name: "extras",
  rules: {
    program: $ => repeat(
      $.number
    ),
    number: $ => /[0-9]+/,
    letter: $ => /[a-zA-Z]/,
    comment: $ => /#.*/,
    complex_extra: $ => seq('(', repeat(choice($.number, $.letter)), ')'),
  },
  extras: $ => [
    $.comment,       // appears in the CST
    $.complex_extra, // same
    "IGNOREME",  // doesn't appear in the CST because it doesn't have a name
    /\s|\\\n/    // same
  ]
})

Test input:

~/ocaml-tree-sitter-core/test/extras $ cat test/ok/complex_extra 
# first comment
1 2 (3 A # extra in the middle of another extra!
 4) 5

Test output (original tree-sitter output followed by the recovered typed representation):

$ ./parse test/ok/complex_extra 
CST obtained from the tree-sitter parser:
program:
| comment
| number
| number
| complex_extra:
|   "("
|   number
|   letter
|   comment
|   number
|   ")"
| number
---
Recovered typed CST:
| 1
| 2
| 5
Extras:
|   Comment
|   "0:0-0:15"
|   "# first comment"
|   Complex_extra
|   "1:4-2:3"
|   | "("
|   |   | Num
|   |   | 3
|   |   | Letter
|   |   | A
|   |   | Num
|   |   | 4
|   | ")"
|   Comment
|   "1:9-1:48"
|   "# extra in the middle of another extra!"
total lines: 4
error lines: 0
error count: 0
success: 100.00%

test plan: make && make install && make test will catch most of the problems that are caused by bad code generation.

Security

Change has no security implications (otherwise, ping the security team)

aryx · 2022-09-18T12:05:42Z

Why does the Ruby and Bash grammars use extra to represent the body part of an heredoc?
How the other languages handle heredocs? PHP has heredocs and we don't have problems there? Or do we?

mjambon · 2022-09-22T17:58:30Z

Why does the Ruby and Bash grammars use extra to represent the body part of an heredoc?

It thought it would be clear from the example. A heredoc body starts on the line after the line that contains the heredoc opening marker. It does not necessarily start right after the marker, which can be followed by other instructions or fragments of the current instruction as a long as we stay on the same line.

The Bash grammar actually doesn't do yet this but it should.

How the other languages handle heredocs? PHP has heredocs and we don't have problems there? Or do we?

According to https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.heredoc, PHP heredocs are sane, i.e. a newline is required right after the opening marker.

mjambon · 2024-09-10T21:38:18Z

I updated the PR so it works again. I had to update how we generate dumpers for extras. I'm now looking into regenerating all the grammars in ocaml-tree-sitter-semgrep.

embedded within other extras.

We could add locations to all nodes of the CST but that would require massive (but easy) changes to the legacy code in semgrep.

with the CST as well as allowing tricks such as merging the TypeScript and TSX CST types.

…oprietary#2260) This changes the type of tree-sitter parsing results. This required regenerating all the grammars. Passing CI tests with this pull requests validates the other two pull requests: * semgrep/ocaml-tree-sitter-core#48 * semgrep/ocaml-tree-sitter-semgrep#510 test plan: `make test` synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15

mjambon marked this pull request as draft September 15, 2022 06:17

mjambon requested a review from nmote September 15, 2022 23:48

mjambon marked this pull request as ready for review September 15, 2022 23:48

mjambon marked this pull request as draft September 15, 2022 23:49

mjambon requested a review from a team September 15, 2022 23:49

mjambon mentioned this pull request Sep 16, 2022

Can't suppress with // nosemgrep because of autoformatter semgrep/semgrep#3521

Closed

mjambon mentioned this pull request Sep 22, 2022

Matching comments is not supported semgrep/semgrep#5931

Open

mjambon mentioned this pull request Nov 7, 2022

Fix 'tree-sitter generate' memory explosion on Hack grammar semgrep/ocaml-tree-sitter-semgrep#365

Open

akuhlens removed the request for review from a team July 25, 2024 17:03

mjambon force-pushed the mj-parse-extras branch from 50f910e to 837731c Compare September 10, 2024 21:36

mjambon mentioned this pull request Sep 11, 2024

Add scripts and instructions for switching between tree-sitter versions #81

Merged

1 task

mjambon and others added 9 commits September 12, 2024 14:09

Collect and translate named extras

744fb68

Test and fix the extraction of extras, and in particular of extras

0a67696

embedded within other extras.

Add locations to the root of extra subtrees.

eb54062

We could add locations to all nodes of the CST but that would require massive (but easy) changes to the legacy code in semgrep.

Update test expectations

fc00add

Add a test (which currently fails)

2bc89d7

Fix: Translators for rules appearing only in extras were removed.

28f1ca7

Port extras to the new dumping mechanism

47a7daa

Add a naming test

dc60e03

Fix extra dumpers

45cb3e5

mjambon force-pushed the mj-parse-extras branch from 837731c to 45cb3e5 Compare September 12, 2024 21:10

mjambon added 3 commits September 12, 2024 17:21

Use polymorphic variants to represent extras. This is for consistency

7ef16ec

with the CST as well as allowing tricks such as merging the TypeScript and TSX CST types.

Populate tree-sitter-version file without an extra step

b3c7e8c

Fix 'cd: downloads: No such file or directory'

06d34da

mjambon marked this pull request as ready for review September 17, 2024 02:24

Make script more robust (?)

5e784fa

mjambon mentioned this pull request Sep 18, 2024

Update the ocaml-tree-sitter runtime library semgrep/ocaml-tree-sitter-semgrep#510

Merged

2 tasks

mjambon requested a review from aryx September 18, 2024 03:01

aryx approved these changes Sep 18, 2024

View reviewed changes

mjambon merged commit e063ec5 into main Sep 18, 2024
2 checks passed

mjambon deleted the mj-parse-extras branch September 18, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract and recover structure of extras #48

Extract and recover structure of extras #48

mjambon commented Sep 15, 2022 •

edited

Loading

aryx commented Sep 18, 2022

mjambon commented Sep 22, 2022 •

edited

Loading

mjambon commented Sep 10, 2024

Extract and recover structure of extras #48

Extract and recover structure of extras #48

Conversation

mjambon commented Sep 15, 2022 • edited Loading

Security

aryx commented Sep 18, 2022

mjambon commented Sep 22, 2022 • edited Loading

mjambon commented Sep 10, 2024

mjambon commented Sep 15, 2022 •

edited

Loading

mjambon commented Sep 22, 2022 •

edited

Loading