-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on including optimization passes? #11
Comments
That's remarkable! It is contrary to my own testing, which indicated that running There are two concerns that I have re: including more stuff in
None of these are insurmountable, and I have reasonably clear criteria for what's acceptable regarding these concerns and what isn't. For (1), if you can get some numbers on how much impact on warm (cached) startup time it has on "a typical x86_64 target" (as opposed to e.g. "Apple M2" which isn't very representative), we can go off from that. For (2), there are two aspects to it. The first one is simply the amount of Stuff that will have to go into the Makefile to ensure that The second one is the cost of updating I don't actually know how bad the cost is. I am open to consideration here, and if you're willing to commit, on an indefinite basis but with the option to stop doing this at any point (at which point I will only support it on a best-effort basis and might just remove (I find the Yosys |
To add to the above, the highlight of the 0.6 milestone is (as is stated in the description) CXXRTL integration, so any additional speedup, much less a 1.5x speedup (which will completely hide the ~10-15% overhead of capturing a full view using the request/replay machinery), is very relevant to the milestone. |
Thinking about this some more, I have a question or suspicion regarding the speedup, where 1.5x is a fairly round number: does that come form less time spent per delta cycle, or fewer delta cycles total? Because what |
It is a rather round number! For the record, I was having the design run at a ~12MHz after warmup without I'll look into which of the two it is (time per delta cycle vs. fewer delta cycles) and do the timing analysis per (1) early next week most likely. As for (2), well, due to all that Nix stuff I did last year ("hdx") I'm un(?)fortunately(??) familiar with the nightmare that is Yosys'
Yesssss, okay, this is really good to know! Thanks and noted! |
Sounds good. |
I compared the return values of I note the generated IL is reduced quite a bit (168,099 bytes after I tried a different codebase and there was very little speedup, so it's very much design-dependent! |
Thanks. All of this makes intuitive sense to me, and I've also been expecting it to be design-dependent. Now that you've confirmed that the difference comes from specifically the netlist optimization, I have no further concerns about the applicability of (If one of the netlists had more delta cycles I'd have asked you to try |
The filesize of Here's our startup time measurements, as measured by Testing on a CPX21 Hetzner cloud instance: Status quo:
With
Testing on a NAS with an AMD R1600 "embedded" CPU: Status quo:
With
As for (2), I'd be happy to keep I must admit, I was pretty indiscriminate with this: I just added everything in |
It's compressed in the
This is all completely fine. It should not be noticeable in any meaningful way--30 ms on an embedded CPU is below the perceptual threshold, I think. |
Yeah, that's fine I think. |
From 5,290,853 bytes to 7,708,881 (+46%). |
Right, so neither the startup latency nor the download size isn't going to be noticeable. I think it's also possible that wasmtime got better in the meantime; back when I first made this package, the difference between running yowasp-yosys and amaranth-yosys was quite drastic. Perhaps at some point it would make sense to even deprecate amaranth-yosys entirely, if yowasp-yosys becomes just as fast--ultimately I don't want to maintain two separate, subtly different and incompatible (and differently versioned) builds of the same software. |
From memory the startup latency was something like 150-300 ms for this package and 3-4 s for yowasp-yosys. So a little too felt for seemingly simple operations like |
Hm. On the Hetzner VPS:
That's 25ms faster than current On my NAS:
27ms faster. Have sanity checked that the Here's the thing that super confuses me — on my M3, here's current
And here's
191% slower. ?????? Something in wasmtime that really struggles on arm64 which only comes out in the full distribution? Or only with WASI SDK 22? It appears to mostly occur during exit: when running e.g. |
Try building amaranth-yosys with wasi-sdk 22.0? |
Will! |
No difference there. Also confirmed the pause is after the entirety of I don't know enough about Python here yet to say. I removing the separate thread (to bring YoWASP closer to how amaranth-yosys runs), removing a bunch of the preopens, but no difference. If this particular behaviour (YoWASP-yosys on darwin-arm64 hangs for ~300ms on exit for whatever reason) is what makes the difference between dropping amaranth-yosys for it and not, I'm happy to figure it out (and probably learn how to use macOS's Instruments or whatever), otherwise I'll leave it. |
That is so puzzling. Let's just ship |
Alright! I'll open a PR. |
README states:
No qualms here, it's
amaranth-yosys
. But, I've noticed runningopt
on a design beforewrite_cxxrtl
can (at least) speed up a low-to-moderate complexity design (barebones RV32 core) by 1.5x Hz, compiling everything with-O3
in both cases.It's not a big deal since I can just use system Yosys, but I thought I'd bring it up and see what you thought. There's no way this would be exposed as-is (and so would be a waste except for people calling Amaranth's Yosys directly like me), but e.g. maybe one day
amaranth.cli
would support enabling optimizations.Noting here that plenty of
opt
passes don't run because we don't runproc
, and runningproc; opt
is worse than doing nothing, presumably because CXXRTL's own proc pass does a much better job at this.(this unearthed when a CI build failed since I deliberately didn't install Yosys for a "units test only" job.)
The text was updated successfully, but these errors were encountered: