parser: remove Regexes from whitespace parser #1008

zsol · 2023-09-02T10:40:43Z

parser: remove Regexes from whitespace parser

removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost

This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)

zsol · 2023-09-02T10:43:00Z

cc @orf @MichaReiser as recent relevant contributors

codecov-commenter · 2023-09-02T10:47:08Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (9c263aa) 91.04% compared to head (f7175d3) 91.04%.
Report is 1 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1008   +/-   ##
=======================================
  Coverage   91.04%   91.04%           
=======================================
  Files         255      255           
  Lines       26366    26366           
=======================================
  Hits        24004    24004           
  Misses       2362     2362

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

orf · 2023-09-02T11:56:36Z

native/libcst/src/tokenizer/whitespace_parser.rs

-            newline_str.chars().count(),
-            newline_str.len(),
-        )?;
+    let len = match newline_after.as_bytes() {


I was interested in the performance characteristics of this vs other ways of checking a string prefix. With this benchmark script,

match &thing.get(..2) { Some("\n") => 1, Some("\r\n") => 2, Some("\r") => 1, _ => 0, }

is a bit faster in all cases: 914 picoseconds vs 680 picoseconds for the \r\n case. Not that it makes much difference, but it is interesting because I would assume they optimize to the same code.

Ahh, that's not exactly equivalent, it would return None on single digit characters. Ignore me.

orf · 2023-09-02T13:32:46Z

This looks good, but I was just thinking about something. It might be a dumb idea, but wouldn't it be more efficient to work out the byte indexes of whitespaces in one pass and re-use that?

let regex = regex::bytes::Regex::new(r#"(([ \t\x0c]|(\r\n?|\n))+)"#,).unwrap();

let whitespace_indexes: Vec<Range<usize>> = regex.find_iter(bytes).map(|r| r.range()).collect();

This gives us a vec of the longest whitespace runs in the input in a single pass, which has got to be more efficient than continually advancing step by step?

There are better datastructures than a vec but if we have this and a byte offset we know if we're in a range of whitespace and what the end position for that whitespace range is? I dunno, something like:

let line_ws = config.ws_by_line[state.line];
for range in line_ws {
    if range.contains(state.byte_offset) {
       let ws = &config.input[state.byte_offset..range.end]
       return Ok(SimpleWhitespace(ws))
   }
}

MichaReiser · 2023-09-02T18:20:11Z

It depends on how frequently this is called, but searching a string is probably more efficient than running a regex, allocating and writing all the positions, and then searching for it (even if it is a binary search). I'll review the code changes tomorrow or on Monday

native/libcst/src/tokenizer/whitespace_parser.rs

MichaReiser · 2023-09-04T06:37:35Z

native/libcst/src/tokenizer/whitespace_parser.rs

+        let line = config.get_line_after_column(line, col)?;
+        let bytes = line.as_bytes();
+        let mut idx = 0;
+        while idx < bytes.len() {


An abstraction that worked really well for Ruff (and Rust) is to use the following Cursor abstraction to lex strings. It avoids the need for these manual match statements.

https://github.com/astral-sh/ruff/blob/4d49d5e8451277f8159a30b7da187626d3a75494/crates/ruff_python_parser/src/lexer/cursor.rs#L14-L13

removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 2, 2023

zsol marked this pull request as ready for review September 2, 2023 10:42

zsol force-pushed the pr1008 branch from 25fb7d2 to e5a7190 Compare September 2, 2023 11:19

orf reviewed Sep 2, 2023

View reviewed changes

MichaReiser approved these changes Sep 4, 2023

View reviewed changes

zsol force-pushed the pr1008 branch from e5a7190 to 499aeef Compare September 9, 2023 10:29

parser: remove Regexes from whitespace parser

f7175d3

removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)

zsol force-pushed the pr1008 branch from 499aeef to f7175d3 Compare September 9, 2023 10:34

zsol merged commit 94dd20e into main Sep 9, 2023
24 of 25 checks passed

zsol deleted the pr1008 branch September 9, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser: remove Regexes from whitespace parser #1008

parser: remove Regexes from whitespace parser #1008

zsol commented Sep 2, 2023 •

edited

Loading

zsol commented Sep 2, 2023

codecov-commenter commented Sep 2, 2023 •

edited

Loading

orf Sep 2, 2023

orf Sep 2, 2023 •

edited

Loading

orf commented Sep 2, 2023 •

edited

Loading

MichaReiser commented Sep 2, 2023

MichaReiser Sep 4, 2023

parser: remove Regexes from whitespace parser #1008

parser: remove Regexes from whitespace parser #1008

Conversation

zsol commented Sep 2, 2023 • edited Loading

zsol commented Sep 2, 2023

codecov-commenter commented Sep 2, 2023 • edited Loading

Codecov Report

orf Sep 2, 2023

Choose a reason for hiding this comment

orf Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

orf commented Sep 2, 2023 • edited Loading

MichaReiser commented Sep 2, 2023

MichaReiser Sep 4, 2023

Choose a reason for hiding this comment

zsol commented Sep 2, 2023 •

edited

Loading

codecov-commenter commented Sep 2, 2023 •

edited

Loading

orf Sep 2, 2023 •

edited

Loading

orf commented Sep 2, 2023 •

edited

Loading