Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain a list of tokens #133

Closed
vosen opened this issue May 14, 2015 · 3 comments
Closed

Maintain a list of tokens #133

vosen opened this issue May 14, 2015 · 3 comments
Assignees

Comments

@vosen
Copy link
Member

vosen commented May 14, 2015

Most of our functionality that relies on output of token list, just lexes whatever text it needs ad-hoc. It'd be much much better to maintain a list of tokens for every span. I've got some Rust lexing code lying around but it'll need some improvements and wrapping (handling newlines as separate tokens, incremental mode, some kind of cache for queries "give ma list of tokens that intersect this span", etc).

@vosen vosen self-assigned this May 14, 2015
@vosen
Copy link
Member Author

vosen commented Oct 18, 2015

This issue turns out to be more worse than I initially thought.
Currently we are implementing syntax highlighting by implementing ITaggerProvider. We always assumed that ITagger<T> will call us line-by-line on every load and change. Which, while wasteful, is correct.
Turns out it's wrong. ITagger<T> can be called any time and with any span VS wants to. Especially with large files VS will skip tagging every line.
In practice, this breaks highlighting of multiline tokens (/**/ comments, strings) in large documents.

So, since ITagger<T> needs rework I thought of doing this properly and tackling this issue.

We can do with our current lexer, but the crux of the problem is in implementing some kind of cache incremental cache for tokens. We should store the tokens in something more sophisticated than a List<T>, I don't want to copy a list of thousands of token every time a character is enetered.
Lately I've been spelunking through decompilations of Microsoft.VisualStudio.Text.Data.dll and its various friends and i think I've got a handle on possible implementations.

  • GetLineNumberFromPosition(..) seems to be relatively cheap. For small documents it sums up an array of line lengths, for larger documents it does some kind of search on a binary tree index of spans. We can rely on this and split lexed document into lines, where line contains ITrackingPoint at the start of the line + List<(Span, TokenType)> for the tokens in the line.
  • Another solution is to simply chuck all the tokens into a binary tree, keeping tokens as (ITrackingPoint, Length, TokenType)

I've tried both solutions and the first one seem simple but is complicated in practice (mostly handling multi-line tokens). I've got invalidation implemented for both approaches, the code is at https://github.com/vosen/VSEditorLexingModel/ and I hope to wrap it up soon.

@vosen vosen mentioned this issue Oct 18, 2015
5 tasks
@briansmith
Copy link

I have noticed the problem where syntax highlighting goes wrong with multi-lined things.

The technique that I've seen used multiple times is this one:
https://github.com/smartmobili/parsing/blob/master/VisualStudioIntegration/PkgDef%20Editor/C%23/PkgDefLanguage.cs#L128

Note that I believe Visual Studio automatically does the right thing except with multi-line tokens, since it doesn't understand multi-line tokens. So, I think it is sufficient to just special-case the multi-line token case.

@vosen vosen mentioned this issue Nov 3, 2015
@vosen
Copy link
Member Author

vosen commented Nov 21, 2015

Closed by #197

@vosen vosen closed this as completed Nov 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants