-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer #1781
Conversation
Cool. Does the new lexer timing stuff say anything interesting about the cost of the new streams? I should probably add a memory size check to those tests... Java makes it hard but maybe there's a new API I don't know about. |
@@ -3,17 +3,13 @@ | |||
* Use of this file is governed by the BSD 3-clause license that | |||
* can be found in the LICENSE.txt file in the project root. | |||
*/ | |||
package org.antlr.v4.test.runtime.java; | |||
package org.antlr.v4.runtime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the test code really be in the runtime instead of the test area?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot! I did this on purpose so I could keep the API CodePointBuffer.getType()
package-private. (That's a common pattern to allow tests but not the public to access certain APIs for testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we should definitely keep it in the testing things so it does not get pushed out as part of the runtime jar. Can we separate somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the confusion. I didn't move which Maven target it was in, so it won't show up in the runtime jar. I just put it in the same Java package so it could access package-private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. ok. that will be a bit weird in dir tree layout but no biggie
I'm trying to run the new test, but it looks like it passes a URL where we expect a Path:
I'll try to fix it. |
ok, thanks. yeah, I've been running from the development environment not from the jar. I should've known there was going to be an issue finding the files to test... |
Fixed the test: #1787 Ran test on Linux with Intel Xeon E5-2600 @ 2.20 GHz. Before:
After:
Now that I have the perf test running cleanly in my environment, I'll attach YourKit and see what I can fix. |
OK, with #1788 on top of this:
I know how to improve the perf a few ways as well. |
Hmm...load times seems slow (you got an SSD and fast clock)? Also, the sizes are bigger not smaller for Parser.java. |
Yep, I profiled the load times and I see the bottleneck. Java's native UTF-8 decoder takes advantage of the raw array interface of I'll do the same. How did you come to the conclusion that the sizes are bigger and not smaller for |
My run shows:
but you are showing (132,788 > 64,788 for Parser.java, 128,671>102,088 for udhr_hin.txt though your utf8 is smaller for Parser.java):
I was hoping to see Parser.java be 50% size of 64,788 for legacy load. |
I think your run might have been missing the actual memory needed to load
the resource from the JAR. Mine includes it as part of
#1788 .
I profiled the memory usage and can get it down another 20% or so. :)
Ben
…On Thu, Mar 23, 2017 at 2:51 PM Terence Parr ***@***.***> wrote:
My run shows:
load_legacy_java_utf8 average time 61us size 64788b over 3500 loads of 29038 symbols from Parser.java
load_legacy_java_utf8 average time 122us size 102088b over 3500 loads of 13379 symbols from udhr_hin.txt
load_new_utf8 average time 212us size 176187b over 3500 loads of 29038 symbols from Parser.java
load_new_utf8 average time 206us size 49439b over 3500 loads of 13379 symbols from udhr_hin.txt
but you are showing (132,788 > 64,788 for Parser.java, 128,671>102,088 for
udhr_hin.txt though your utf8 is smaller for Parser.java):
load_legacy_java_utf8 average time 272us size 132788b over 3500 loads of 29038 symbols from Parser.java
load_legacy_java_utf8 average time 305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt
load_new_utf8 average time 745us size 116765b over 3500 loads of 29038 symbols from Parser.java
load_new_utf8 average time 539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt
I was hoping to see Parser.java be 50% size of 64,788 for legacy load.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1781 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU32JXgqkOftQLOWf9WtjTKWKT9LQks5roulJgaJpZM4MkmrE>
.
|
ah. damn java is hard to profile stuff. Hmm...ok, i'll take a look later. |
Actually, shouldn't this avoid ClassLoader related memory requirements? public void load_legacy_java_ascii(int n) throws Exception {
URL sampleJavaFile = TimeLexerSpeed.class.getClassLoader().getResource(Parser_java_file);
long start = System.nanoTime();
CharStream[] input = new CharStream[n]; // keep refs around so we can average memory
System.gc();
long beforeFreeMem = Runtime.getRuntime().freeMemory(); |
Oh, nevermind, i see the diff in our code. yeah, we should try to isolate the buffer size if we can. |
With a bit o' wrangling:
|
d38979c
to
aae0f2a
Compare
OK, I ended up removing my hand-written Memory usage is still better than before, but I'll profile more to see what I can prune. BTW, #1788 is included inside this PR now. I'll rebase whenever #1788 lands. |
YourKit profiling is a wonderful thing! I found the perf hotspot in Refactoring it into To fix the perf, I made Et voilà, perf is in line with or slightly better than the original
|
(I sure wish Java had proper generics on primitive types.. This whole type erasure thing was definitely not the right call!) |
Ben,
unrelated to this but you may want to look at #1786 and comment
Eric
… Le 25 mars 2017 à 01:35, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit :
(I sure wish Java had proper generics on primitive types.. This whole type erasure thing was definitely not the right call!)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJGE5bY15W909JeVW0JFXos4wDOmNks5ro_7ZgaJpZM4MkmrE>.
|
Found a nice ~50% speed boost in the lexer by making Sent out a separate PR for that: #1789 Latest numbers:
|
8ccdc3f
to
d7f3481
Compare
My phat box gets these numbers now (with IntervalSet binary search update and with an adjustment to memory size calculation but NOT these new stream changes):
|
Weird. here are the numbers after bringing in these changes into a test branch. The legacy stuff is noticeably slower:
I don't see how that is possible given changes in this PR don't affect legacy streams or lexer in any way I can see. @bhamiltoncx ? any ideas? Note that some of the new numbers are much better due to the new streams stuff. And (woot!) size of streams is smaller for ascii java code using new streams. |
wat.gif
…On Tue, Mar 28, 2017, 6:14 PM Terence Parr ***@***.***> wrote:
Warming up with new streams, testing legacy, I can knock from 455ms to
355ms by create new JavaLexer each iteration in tokenize():
CharStream input = lexer.getInputStream(); <-- add this
for (int i = 0; i<n; i++) {
lexer = new JavaLexer(input); <--- add this
lexer.reset();
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1781 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU-xjuoihyhHLsuhXv_N5sEQrsUBtks5rqZQ4gaJpZM4MkmrE>
.
|
Hmm...ok, seems to bounce between 350 and 450ms with or w/o the new JavaLexer. This is with 30,000 runs, dropping highest 10% CPU time runs. legacy warmup is rock solid 285ms testing repeatedly with 30,000 runs. something is causing this but I'll be damned if I know! |
627a0c7
to
ab06555
Compare
So, I've determined it's not the |
FYI, I'm seeing the behavior where the legacy lexer slows down 5-10% when we warm up the new lexer on master, so I don't think it's related to this PR. |
I saw same thing yesterday. I believe I have a good hypothesis. If it ain't the data that's changing, it must be the code. If the code ain't changing it must be the JIT optimization. We got a hint about this yesterday when consume() all of a sudden pops up as significant. I believe it is the dynamic dispatch through an interface (vs class hierarchy) that prevents optimization. I have an experiment to try as soon as I get in, but this theory smacks of correctness. |
Debugging == the scientific method of computing. :) |
Ahhh. I wouldn't be at all surprised if the JIT was optimized for a single implementation of an interface. That would avoid the vtable dispatch in consume() if we don't warm up the new lexer, but as soon as we warm up the new lexer, we have two implementations of |
Yup, that's it. Check out the last two bullets: https://wiki.openjdk.java.net/display/HotSpot/PerformanceTechniques Methods
|
That explains why the JMH benchmark isn't impacted—it only ever sees one implementation of |
So, tl;dr: don't mix I think this is strong evidence that we should just mark |
Victory! Yep, we mark as deprecated and use new interface. let me scour interface today so we can move forward with 4.7 release :) I was going to bust out https://github.com/AdoptOpenJDK/jitwatch but maybe I can "prove" this with some code duplication...maybe make dummy common root to ANTLRInputStream and CodePointStream? |
Confirmed. If you run just one test, lex_new_java_utf8, without warmup it's slightly faster than old stream. If you add legacy warmup, now new stream test is slow again. I think that pretty much proves it. Using both causes megamorphic calls, preventing inlining and such too. I should write this debugging path for the students! |
Yep, here's a conclusive proof: I made a new pass-through second implementation of No new code needed. Jeez, the JIT is complicated. One more argument in favor of JMH.. |
Well, sorta ;) Microbenchmark in isolation totally missed this issue as it not a realistic environment to test in practice stuff. Someone could report benchmarks of 25% slower speed for ANTLR erroneously. Sam and I clawed our way to the current speed so it was potentially a problem. @sharwell was the one that asked me to run timing tests on new stuff to verify we didn't erode our win. A wider test such as the one I handrolled caught something that forces my hand on deprecating old shit in favor of your awesome new streams! |
True that! I think it's also an argument in favor of aggressively making classes |
Re new interface, what happens when someone sends in charset utf-16 or utf-32? |
Functionality will work fine, I'll add a test. (Or do you mean perf-wise?) |
Just responding to comment "For other sources, only supports Unicode code points up to U+FFFF." :) |
I'm pulling this in so we can tweak comments etc... and move forward. Could you freshen interfaces in other languages to be as consistent as possible? thanks! |
Oops, comment is definitely out-of-date! Thanks, will follow up with a PR to add a test and fix the comments. |
Do a victory lap, Ben!!! |
BTW, from a Java expert at FB, we might want to look at JMH's BULK warmup mode to simulate scenarios like this:
|
@bhamiltoncx What was the result of using |
@sharwell: Since we read the entire file upfront anyway, I felt it made
sense to avoid any runtime logic altogether and decode to code points when
we read the file into memory.
That way, LA1() becomes a simple array access, no conditionals, The
profiling results show the end result uses less memory (byte array for
Latin input) and the same amount of CPU.
…On Thu, Mar 30, 2017 at 5:28 AM Sam Harwell ***@***.***> wrote:
@bhamiltoncx <https://github.com/bhamiltoncx> What was the result of
using ANTLRUnicodeInputStream ***@***.***
<sharwell@51dbd231>) instead of a whole
new stream? One optimization that Java seems to like that I didn't do in
this file so far is splitting LA(int) into a short fast path that can be
inlined, with the longer slow path after the initial condition moved to a
separate method. This stream is optimized for code points no greater than
U+FFFF, but the lexer itself is optimized for code points no greater than
U+007F so that doesn't seem like a particular problem or mismatch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1781 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU3-Tp3q9TIFqwQLSUJLlOvun9q0oks5rq4O-gaJpZM4MkmrE>
.
|
Here is an unbelievably amazing write-up of exactly the processes behind the JIT optimizer that explains what we ran into here. Great reading! |
This greatly improves the memory usage and performance of
CodePointCharStream
by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <=U+00FF
), 16-bit buffer (for Unicode code points <=U+FFFF
), or a 32-bit buffer (Unicode code points >U+FFFF
).I split out the internal storage into a class
CodePointBuffer
which has aCodePointBuffer.Builder
class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.I found the perf hotspot in
CodePointCharStream
on master was the virtual method calls fromCharStream.LA(offset)
intoIntBuffer
.Refactoring it into
CodePointBuffer
didn't help (in fact, it added another virtual method call).To fix the perf, I made
CodePointCharStream
anabstract
class and made three concrete subclasses:CodePoint8BitCharStream
,CodePoint16BitCharStream
, andCodePoint32BitCharStream
which directly access the array of underlying code points in theCodePointBuffer
without virtual method calls.