Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: remove Regexes from whitespace parser #1008

Merged
merged 1 commit into from
Sep 9, 2023
Merged

parser: remove Regexes from whitespace parser #1008

merged 1 commit into from
Sep 9, 2023

Conversation

zsol
Copy link
Member

@zsol zsol commented Sep 2, 2023

parser: remove Regexes from whitespace parser

removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost

This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 2, 2023
@zsol zsol marked this pull request as ready for review September 2, 2023 10:42
@zsol
Copy link
Member Author

zsol commented Sep 2, 2023

cc @orf @MichaReiser as recent relevant contributors

@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (9c263aa) 91.04% compared to head (f7175d3) 91.04%.
Report is 1 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1008   +/-   ##
=======================================
  Coverage   91.04%   91.04%           
=======================================
  Files         255      255           
  Lines       26366    26366           
=======================================
  Hits        24004    24004           
  Misses       2362     2362           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

newline_str.chars().count(),
newline_str.len(),
)?;
let len = match newline_after.as_bytes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was interested in the performance characteristics of this vs other ways of checking a string prefix. With this benchmark script,

match &thing.get(..2) {
        Some("\n") => 1,
        Some("\r\n") => 2,
        Some("\r") => 1,
        _ => 0,
    }

is a bit faster in all cases: 914 picoseconds vs 680 picoseconds for the \r\n case. Not that it makes much difference, but it is interesting because I would assume they optimize to the same code.

Copy link
Contributor

@orf orf Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that's not exactly equivalent, it would return None on single digit characters. Ignore me.

@orf
Copy link
Contributor

orf commented Sep 2, 2023

This looks good, but I was just thinking about something. It might be a dumb idea, but wouldn't it be more efficient to work out the byte indexes of whitespaces in one pass and re-use that?

let regex = regex::bytes::Regex::new(r#"(([ \t\x0c]|(\r\n?|\n))+)"#,).unwrap();

let whitespace_indexes: Vec<Range<usize>> = regex.find_iter(bytes).map(|r| r.range()).collect();

This gives us a vec of the longest whitespace runs in the input in a single pass, which has got to be more efficient than continually advancing step by step?

There are better datastructures than a vec but if we have this and a byte offset we know if we're in a range of whitespace and what the end position for that whitespace range is? I dunno, something like:

let line_ws = config.ws_by_line[state.line];
for range in line_ws {
    if range.contains(state.byte_offset) {
       let ws = &config.input[state.byte_offset..range.end]
       return Ok(SimpleWhitespace(ws))
   }
}

@MichaReiser
Copy link
Contributor

It depends on how frequently this is called, but searching a string is probably more efficient than running a regex, allocating and writing all the positions, and then searching for it (even if it is a binary search). I'll review the code changes tomorrow or on Monday

native/libcst/src/tokenizer/whitespace_parser.rs Outdated Show resolved Hide resolved
let line = config.get_line_after_column(line, col)?;
let bytes = line.as_bytes();
let mut idx = 0;
while idx < bytes.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An abstraction that worked really well for Ruff (and Rust) is to use the following Cursor abstraction to lex strings. It avoids the need for these manual match statements.

https://github.com/astral-sh/ruff/blob/4d49d5e8451277f8159a30b7da187626d3a75494/crates/ruff_python_parser/src/lexer/cursor.rs#L14-L13

removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost

This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)
@zsol zsol merged commit 94dd20e into main Sep 9, 2023
24 of 25 checks passed
@zsol zsol deleted the pr1008 branch September 9, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants