Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor AST codegen #4442

Closed
overlookmotel opened this issue Jul 24, 2024 · 5 comments
Closed

Refactor AST codegen #4442

overlookmotel opened this issue Jul 24, 2024 · 5 comments
Labels
C-cleanup Category - technical debt or refactoring. Solution not expected to change behavior

Comments

@overlookmotel
Copy link
Collaborator

I've been reading through the AST codegen (most recent version as in draft PR #4404). @rzvxa and I are meeting to discuss this evening, but just to put some thoughts down on "paper" in advance...

My opinion of current state

The codegen is doing a great job of automating creating a ton of Oxc's code now. It makes it possible to make changes to AST without having to make corresponding updates in a bunch of other places (and probably making mistakes/omissions in the process). This is hugely valuable.

However, personally I find the implementation of the codegen itself complicated and hard to understand. The primary reasons for this are:

  1. Using syn types throughout, which are unfamiliar to those of us who are not macro maestros.
  2. Mixing those syn types with our own "metadata" types, but with similar names (e.g. TypeDef, TypeRef).
  3. Lack of code comments.
  4. (in my view) excessive abstraction in some places.

Proposed refactor

I propose that we refactor it along these lines:

Generate from schema

How it is now

Currently the way codegen works (roughly speaking, I may not have understood it correctly) is:

  • Parse .rs source files and generate REnum + RStruct types, which are mostly wrappers around syn types.
  • Generate new files mostly directly from these R* types.
  • Also separately generate a Schema object which is a simpler representation of the types.

Proposal: Simplify the pipeline

  1. Read + parse the input .rs files.
  2. Build schema including all info that generators need.
  3. Generate new files from the schema (not going back to the original syn types).

Arguments against

There are valid arguments against what I'm suggesting:

  1. The current implementation is more efficient than what I'm proposing - it avoids e.g. generating syn Idents where there was already an Ident in source file AST that can be reused.
  2. Rzvxa has said that advantage of working with the syn types is to make the code flexible so it could converted easily to macros if we want to do that.

However, I feel like we are optimizing for the wrong things here.

  1. The codegen runs rarely, so its performance is not really important. Making it comprehensible and maintainable is much more valuable.
  2. I think we're fairly clear what is going to be in macros, and what in codegen (very little in macros, basically). So, as I see it, current approach is buying ourselves flexibility to do things that we're pretty clear we don't intend to do, at a cost of making everything a great deal more complicated.

Arguments for

The main argument is simplicity.

I have high hopes that the codegen is going to grow and grow. We can use it to generate more code, and also pull off some more advanced tricks which would be infeasible/unsafe to do by hand. e.g.:

  • Replace the code for inherit_variants! (including calculating an optimal set of values for the enum discriminants).
  • Then generate optimized methods which use unsafe code to take advantage of patterns in these discriminants for speed.
  • Generate Serialize impls.
  • Generate more traits.
  • Calculate type layouts (current area of focus).
  • Building mechanisms for AST transfer.

As we get into more complex areas, working from a simpler model will be very valuable. Especially when generating unsafe code, it's vitally important that the codegen creating that code is easy to understand, so can validate the soundness of its logic.

Building + using the schema

I suggest the following steps (mostly as it is now, but a little bit more structured):

  1. Compile list of all source .rs files codegen need to read.
  2. Read and parse all these files.
  3. Build a Vec of (TypeId, TypeName, Item) containing all types with #[ast] attr. Throw away everything else. TypeId is index into this Vec. Simultaneously build a HashMap<TypeName, TypeId>.
  4. Pass through that Vec again, this time fully parsing the structs + enums to create StructDef / EnumDef objects for each. "Links" between types can use TypeIds (as we can resolve type names to IDs after step 3). This is our schema.
  5. Pass the schema to each generator.
  6. Write generated files to disk.

NB: If some generators need access to original syn types for some reason, that's not impossible - they can get them via the Vec built in step 3.

Code style

Again, aiming for simplicity... While we are refactoring, I think we could also move to a more imperative style in parts. In my opinion, the "depth" of abstractions is a little excessive at present. If the structure was less labyrinthine, I think we also won't need RefCells any more, which will further simplify the code. I would hope this will make the code easier to follow.

@overlookmotel overlookmotel added the C-cleanup Category - technical debt or refactoring. Solution not expected to change behavior label Jul 24, 2024
@rzvxa
Copy link
Collaborator

rzvxa commented Jul 24, 2024

I think going for simplicity is generally a good thing, and there are many flaws in the current implementation that we should fix as we move forward. I just want to elaborate on something, You've said:

Using syn types throughout, which are unfamiliar to those of us who are not macro maestros.

I beg the differ, I think using syn types is the standard way for doing any meta programming in Rust, A custom schema developed in-house although being simpler since we don't need the information all at once is much harder to learn as there would be no prior knowledge nor documentation/guides out there. On the other side, While not all people have used syn before; The ones with experience developing procedural macros(which are most likely to want to contribute to this area) should be familiar with it. Also, people who are new to Rust's procedural macros have lots of resources to study and they can carry this knowledge with them throughout their journey as Rust developers.

Edit:

Maybe we should get rid of RTypes, We can use syn types directly and access the metadata through the codegen context. Where we used to pass REnum we use syn::ItemEnum, syn::ItemStruct in place of RStruct. And we simply use syn::Item instead of RType. We lose some information that we can compensate for with a tag in their meta.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jul 25, 2024

Oh dear! It seems we fundamentally disagree, and both feel strongly about it. I'll try and debate this a bit more:

I think using syn types is the standard way for doing any meta programming in Rust

I do see your argument here.

BUT I don't see what codegen is doing as "meta programming" in the usual sense.

"Meta programming" in Rust basically means proc macros. A proc macro:

  • can only see the type which it's attached to.
  • typically performs fairly simple AST transformations.
  • typically doesn't have to understand much about what the symbols mean, just rearrange them.

Whereas in codegen we are doing something quite different:

  1. We are building a single unified model of all the types, and their inter-relations.
  2. What codegen generates for one type depends on knowledge of other types.
  3. Code generators need to understand quite a lot about the meaning of the types and how they connect to each other.

These are relatively complex operations, which go far beyond simple AST manipulation. So I'd say it's less like "meta programming", and more like a compiler. And in a compiler, the pattern of creating an intermediate representation to work from is a common idiom. That's what I'm suggesting here (the "schema" is an IR).

Here's an analogy from another part of Oxc: In Semantic we build a scope tree. All the info in the scope tree is a duplication of information that's already in the AST, but formed into a structure which is easier to work with. When looking up the binding for an identifier reference, we could do without ScopeTree and just walk the AST, and we could find the binding that way. But obviously that would be a huge pain.

I feel that staying in "AST land" in the codegen is operating at wrong level of abstraction, in the same way as trying to resolve bindings direct from the AST would be. Just like we're better off using the ScopeTree IR to resolve bindings, we're better off working from Schema IR in the codegen.

Also, people who are new to Rust's procedural macros have lots of resources to study

Yes it's true that syn's types are documented. But reading those docs is hardly a joy! There's over 100 types and they encode all kinds of information which is totally irrelevant (like whether there is a trailing comma after the last field of a struct). I have written proc macros, so am somewhat familiar with syn's types, but I still find it completely horrible to work with them!

It's true that our "schema" types are non-standard, but I don't think they're hard for anyone to understand:

pub struct StructDef {
    pub name: TypeName,
    pub fields: Vec<FieldDef>,
    pub has_lifetime: bool,
}

pub struct FieldDef {
    /// `None` if unnamed
    name: Option<String>,
    r#type: TypeName,
}

(compare to syn's equivalent ItemStruct and Fields)

The ones with experience developing procedural macros(which are most likely to want to contribute to this area) should be familiar with syn

The codegen is becoming completely fundamental to Oxc. I really don't think we should require in-depth knowledge of syn to be a pre-condition to be able to hack on it.

@Boshen
Copy link
Member

Boshen commented Jul 25, 2024

After going through the comments and the code itself, I think some kind of guidance on creating a new generator is needed, everything else is fine by me and they can be iterated upon.

It's often hard to get to the final state of what we wish for for this kind of partially ad-hoc code. For example, the test runners in oxc_coverage is a dumpster fire and I had the urge to refactor it from time to time ... but I decided not to because it's not a place where we change code often.

So given the code is not going to change often in ast_codegen, we can hack it from time to time, but a big refactor may not be necessary.

In software engineering we should spend more time on crafting code that will change a lot for a higher impact value.

@Boshen
Copy link
Member

Boshen commented Jul 27, 2024

Closing as "issue too broad". Prefer incremental or more concrete changes.

@Boshen Boshen closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2024
overlookmotel pushed a commit that referenced this issue Aug 6, 2024
This PR introduces `EarlyCtx` and `LateCtx` in place of the old `CodegenCtx`, Early passes operate at the AST level while generators and other late passes operate on the schema.

It will also replace the confusing `RType` name with something more idiomatic ~~(open for suggestions, I haven't found a good name yet)~~ I've named it `AstType` and dropped the `R` prefix for `REnum` and `RStruct`.

There are some qualities of life improvements too, Things like `to_type_elide` can be used to simplify the code.

Related to #4442 (and can potentially mark it "close as fixed").
@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Aug 6, 2024

rzvxa and I came to agreement in the end. Codegen has been refactored to be more schema-centric in #4682.

@rzvxa rzvxa closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-cleanup Category - technical debt or refactoring. Solution not expected to change behavior
Projects
None yet
Development

No branches or pull requests

3 participants