Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer #1781

Merged
merged 1 commit into from
Mar 29, 2017

Conversation

bhamiltoncx
Copy link
Contributor

@bhamiltoncx bhamiltoncx commented Mar 22, 2017

This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Cool. Does the new lexer timing stuff say anything interesting about the cost of the new streams? I should probably add a memory size check to those tests... Java makes it hard but maybe there's a new API I don't know about.

@@ -3,17 +3,13 @@
* Use of this file is governed by the BSD 3-clause license that
* can be found in the LICENSE.txt file in the project root.
*/
package org.antlr.v4.test.runtime.java;
package org.antlr.v4.runtime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the test code really be in the runtime instead of the test area?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! I did this on purpose so I could keep the API CodePointBuffer.getType() package-private. (That's a common pattern to allow tests but not the public to access certain APIs for testing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we should definitely keep it in the testing things so it does not get pushed out as part of the runtime jar. Can we separate somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the confusion. I didn't move which Maven target it was in, so it won't show up in the runtime jar. I just put it in the same Java package so it could access package-private methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. ok. that will be a bit weird in dir tree layout but no biggie

@bhamiltoncx
Copy link
Contributor Author

Does the new lexer timing stuff say anything interesting about the cost of the new streams?

I'm trying to run the new test, but it looks like it passes a URL where we expect a Path:

% java -cp /data/users/beng/src/antlr4/runtime/Java/target/antlr4-runtime-4.7-SNAPSHOT.jar:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT.jar:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT-tests.jar org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed      
Warming up Java compilerException in thread "main" java.nio.file.NoSuchFileException: file:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT.jar!/Java/src/org/antlr/v4/runtime/Parser.java
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.Files.newByteChannel(Files.java:361)
        at java.nio.file.Files.newByteChannel(Files.java:407)
        at org.antlr.v4.runtime.CharStreams.fromPath(CharStreams.java:59)
        at org.antlr.v4.runtime.CharStreams.fromPath(CharStreams.java:43)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.lex_new_java_utf8(TimeLexerSpeed.java:208)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.compilerWarmUp(TimeLexerSpeed.java:110)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.main(TimeLexerSpeed.java:71)

I'll try to fix it.

@parrt
Copy link
Member

parrt commented Mar 23, 2017

ok, thanks. yeah, I've been running from the development environment not from the jar. I should've known there was going to be an issue finding the files to test...

@bhamiltoncx
Copy link
Contributor Author

Fixed the test: #1787

Ran test on Linux with Intel Xeon E5-2600 @ 2.20 GHz.

Before:

Warming up Java compiler....                                                                                                    
    load_legacy_java_utf8 average time   248us over 3500 loads of 29038 symbols from Parser.java                                
    load_legacy_java_utf8 average time   301us over 3500 loads of 13379 symbols from udhr_hin.txt                               
            load_new_utf8 average time   535us over 3500 loads of 29038 symbols from Parser.java                                
            load_new_utf8 average time   420us over 3500 loads of 13379 symbols from udhr_hin.txt                               
                                                                                                                                
     lex_legacy_java_utf8 average time   625us over 2000 runs of 29038 symbols                                                  
     lex_legacy_java_utf8 average time  1678us over 2000 runs of 29038 symbols DFA cleared                                      
        lex_new_java_utf8 average time   644us over 2000 runs of 29038 symbols                                                  
        lex_new_java_utf8 average time  1534us over 2000 runs of 29038 symbols DFA cleared                                      
                                                                                                                                
 lex_legacy_grapheme_utf8 average time 12950us over  400 runs of  6614 symbols from udhr_kor.txt                                
 lex_legacy_grapheme_utf8 average time 11953us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared                    
 lex_legacy_grapheme_utf8 average time 10144us over  400 runs of 13379 symbols from udhr_hin.txt                                
 lex_legacy_grapheme_utf8 average time 11146us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared                    
    lex_new_grapheme_utf8 average time 11914us over  400 runs of  6614 symbols from udhr_kor.txt                                
    lex_new_grapheme_utf8 average time 12030us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared                    
    lex_new_grapheme_utf8 average time 10400us over  400 runs of 13379 symbols from udhr_hin.txt                                
    lex_new_grapheme_utf8 average time 10365us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared                    
    lex_new_grapheme_utf8 average time   183us over  400 runs of    85 symbols from emoji.txt                                   
    lex_new_grapheme_utf8 average time   192us over  400 runs of    85 symbols from emoji.txt DFA cleared        

After:

Warming up Java compiler....
    load_legacy_java_utf8 average time   247us over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   294us over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   754us over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   573us over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   661us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1753us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1313us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  2217us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 11915us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12356us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 10346us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 10506us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12087us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12174us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 10762us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 10873us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   194us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   204us over  400 runs of    85 symbols from emoji.txt DFA cleared

Now that I have the perf test running cleanly in my environment, I'll attach YourKit and see what I can fix.

@bhamiltoncx
Copy link
Contributor Author

OK, with #1788 on top of this:

    load_legacy_java_utf8 average time   272us size 132788b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   745us size 116765b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt

I know how to improve the perf a few ways as well.

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Hmm...load times seems slow (you got an SSD and fast clock)? Also, the sizes are bigger not smaller for Parser.java.

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 23, 2017

Yep, I profiled the load times and I see the bottleneck.

Java's native UTF-8 decoder takes advantage of the raw array interface of ByteBuffer to avoid lots of virtual function calls to access the bytes inside:

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/cs/UTF_8.java#UTF_8.Decoder.decodeArrayLoop%28java.nio.ByteBuffer%2Cjava.nio.CharBuffer%29

I'll do the same.

How did you come to the conclusion that the sizes are bigger and not smaller for Parser.java? Did I misread the output?

@parrt
Copy link
Member

parrt commented Mar 23, 2017

My run shows:

     load_legacy_java_utf8 average time    61us size  64788b over 3500 loads of 29038 symbols from Parser.java
     load_legacy_java_utf8 average time   122us size 102088b over 3500 loads of 13379 symbols from udhr_hin.txt
             load_new_utf8 average time   212us size 176187b over 3500 loads of 29038 symbols from Parser.java
             load_new_utf8 average time   206us size  49439b over 3500 loads of 13379 symbols from udhr_hin.txt

but you are showing (132,788 > 64,788 for Parser.java, 128,671>102,088 for udhr_hin.txt though your utf8 is smaller for Parser.java):

    load_legacy_java_utf8 average time   272us size 132788b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   745us size 116765b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt

I was hoping to see Parser.java be 50% size of 64,788 for legacy load.

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 23, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 23, 2017

ah. damn java is hard to profile stuff. Hmm...ok, i'll take a look later.

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Actually, shouldn't this avoid ClassLoader related memory requirements?

	public void load_legacy_java_ascii(int n) throws Exception {
		URL sampleJavaFile = TimeLexerSpeed.class.getClassLoader().getResource(Parser_java_file);
		long start = System.nanoTime();
		CharStream[] input = new CharStream[n]; // keep refs around so we can average memory
		System.gc();
		long beforeFreeMem = Runtime.getRuntime().freeMemory();

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Oh, nevermind, i see the diff in our code. yeah, we should try to isolate the buffer size if we can.

@bhamiltoncx
Copy link
Contributor Author

With a bit o' wrangling:

Warming up Java compiler....
    load_legacy_java_utf8 average time   270us size 132216b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   290us size 127886b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   253us size 114134b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   310us size 112967b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   804us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1804us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1308us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  2418us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 11883us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12017us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 10456us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 10347us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12161us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12350us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 11154us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 11117us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   213us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   202us over  400 runs of    85 symbols from emoji.txt DFA cleared

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 23, 2017

OK, I ended up removing my hand-written UTF8CodePointDecoder completely and just relying on the built-in JDK CharsetDecoder to convert from arbitrary encodings to UTF-16, then convert UTF-16 to Unicode code points.

Memory usage is still better than before, but I'll profile more to see what I can prune.

BTW, #1788 is included inside this PR now. I'll rebase whenever #1788 lands.

@bhamiltoncx
Copy link
Contributor Author

YourKit profiling is a wonderful thing!

I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

Et voilà, perf is in line with or slightly better than the original ANTLRInputStream and memory usage is better as well:

Warming up Java compiler....
    load_legacy_java_utf8 average time   275us size 132669b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   292us size 128806b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   266us size 114950b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   311us size 113672b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time  1065us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1971us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1026us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1960us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 12545us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12611us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 11303us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 11195us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12532us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12387us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 10856us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 11043us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   187us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   197us over  400 runs of    85 symbols from emoji.txt DFA cleared

@bhamiltoncx
Copy link
Contributor Author

(I sure wish Java had proper generics on primitive types.. This whole type erasure thing was definitely not the right call!)

@bhamiltoncx bhamiltoncx changed the title Improve memory usage of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer Mar 24, 2017
@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 24, 2017 via email

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 24, 2017

Found a nice ~50% speed boost in the lexer by making IntervalSet.contains(el) use binary search instead of a linear scan.

Sent out a separate PR for that: #1789

Latest numbers:

    load_legacy_java_utf8 average time   273us size 132683b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   293us size 128847b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   262us size 115666b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   313us size 113759b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time  1049us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1957us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   785us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1842us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  7223us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  7226us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 13136us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 13283us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared    lex_new_grapheme_utf8 average time  7270us over  400 runs of  6614 symbols from udhr_kor.txt    lex_new_grapheme_utf8 average time  7420us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 13323us over  400 runs of 13379 symbols from udhr_hin.txt    lex_new_grapheme_utf8 average time 13433us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   148us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   150us over  400 runs of    85 symbols from emoji.txt DFA cleared

@bhamiltoncx bhamiltoncx force-pushed the char-streams-memory-usage branch 2 times, most recently from 8ccdc3f to d7f3481 Compare March 24, 2017 19:44
@parrt
Copy link
Member

parrt commented Mar 24, 2017

My phat box gets these numbers now (with IntervalSet binary search update and with an adjustment to memory size calculation but NOT these new stream changes):

Java VM args: -Xms2G -Xmx2G 
Warming up Java compiler....
   load_legacy_java_ascii average time    44us size  54301b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time    46us size  94675b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   125us size  37490b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   201us size 145893b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   202us size  57482b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   367us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time   907us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   434us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time   963us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  4028us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  4128us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time  7661us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time  7581us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time  4026us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time  4115us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time  7547us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time  7613us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time    80us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time    88us over  400 runs of    85 symbols from emoji.txt DFA cleared

@parrt
Copy link
Member

parrt commented Mar 24, 2017

Weird. here are the numbers after bringing in these changes into a test branch. The legacy stuff is noticeably slower:

Java VM args: -Xms2G -Xmx2G 
Warming up Java compiler....
   load_legacy_java_ascii average time    46us size  54315b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time    48us size  59922b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   118us size  22960b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time    55us size  21470b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   121us size  24466b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   601us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1130us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   637us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1145us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  4247us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  4283us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time  7745us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time  7851us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time  4201us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time  4299us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time  7722us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time  7824us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time    81us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time    85us over  400 runs of    85 symbols from emoji.txt DFA cleared

I don't see how that is possible given changes in this PR don't affect legacy streams or lexer in any way I can see. @bhamiltoncx ? any ideas? Note that some of the new numbers are much better due to the new streams stuff. And (woot!) size of streams is smaller for ascii java code using new streams.

@parrt parrt added this to the 4.7 milestone Mar 24, 2017
@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 28, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 28, 2017

Hmm...ok, seems to bounce between 350 and 450ms with or w/o the new JavaLexer. This is with 30,000 runs, dropping highest 10% CPU time runs. legacy warmup is rock solid 285ms testing repeatedly with 30,000 runs. something is causing this but I'll be damned if I know!

@bhamiltoncx
Copy link
Contributor Author

So, I've determined it's not the CharStreams load, and it's not constructing the Lexer, but actually invoking the lexer on the Java source with the new stream that angers something in the legacy stream.

@bhamiltoncx
Copy link
Contributor Author

FYI, I'm seeing the behavior where the legacy lexer slows down 5-10% when we warm up the new lexer on master, so I don't think it's related to this PR.

@parrt
Copy link
Member

parrt commented Mar 29, 2017

I saw same thing yesterday. I believe I have a good hypothesis. If it ain't the data that's changing, it must be the code. If the code ain't changing it must be the JIT optimization. We got a hint about this yesterday when consume() all of a sudden pops up as significant. I believe it is the dynamic dispatch through an interface (vs class hierarchy) that prevents optimization. I have an experiment to try as soon as I get in, but this theory smacks of correctness.

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Debugging == the scientific method of computing. :)

@bhamiltoncx
Copy link
Contributor Author

Ahhh. I wouldn't be at all surprised if the JIT was optimized for a single implementation of an interface.

That would avoid the vtable dispatch in consume() if we don't warm up the new lexer, but as soon as we warm up the new lexer, we have two implementations of IntStream.consume(), so the JIT has to fall back to vtable dispatch.

@bhamiltoncx
Copy link
Contributor Author

Yup, that's it. Check out the last two bullets:

https://wiki.openjdk.java.net/display/HotSpot/PerformanceTechniques

Methods

  • Methods are often inlined. This increases the compiler's "horizon" of optimization.
  • Static, private, final, and/or "special" invocations are easy to inline.
  • Virtual (and interface) invocations are often demoted to "special" invocations, if the class hierarchy permits it. A dependency is registered in case further class loading spoils things.
  • Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
  • Depending on the profile, a failure of the optimistic check will either deoptimize or run through a (slow) vtable/itable call.
  • On the fast path of an optimistically typed call, inlining is common. The best case is a de facto monomorphic call which is inlined. Such calls, if back-to-back, will perform the receiver type check only once.
  • In the absence of strong profiling information, a virtual (or interface) call site will be compiled in an agnostic state, waiting for the first execution to provide a provisional monomorphic receiver. (This is called an "inline cache".)
  • An inline cache will flip to a monomorphic state at the first call, and stay in that state as long as the exact receiver type (not a subtype) is repeated every time.
  • An inline cache will flip to a "megamorphic" state if a second receiver type is encountered.
    Megamorphic calls use assembly-coded vtable and itable stubs, patched in by the JVM. The compiler does not need to manage them.

@bhamiltoncx
Copy link
Contributor Author

That explains why the JMH benchmark isn't impacted—it only ever sees one implementation of IntStream (each test is run in its own forked JVM), so the JIT optimizes everything out the wazoo.

@bhamiltoncx
Copy link
Contributor Author

So, tl;dr: don't mix ANTLRInputStream and CodePointCharStream in the same Lexer and you should be fine. :)

I think this is strong evidence that we should just mark ANTLRInputStream and ANTLRFileStream as @Deprecated and then remove them in 4.8.

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Victory! Yep, we mark as deprecated and use new interface. let me scour interface today so we can move forward with 4.7 release :) I was going to bust out https://github.com/AdoptOpenJDK/jitwatch but maybe I can "prove" this with some code duplication...maybe make dummy common root to ANTLRInputStream and CodePointStream?

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Confirmed. If you run just one test, lex_new_java_utf8, without warmup it's slightly faster than old stream. If you add legacy warmup, now new stream test is slow again. I think that pretty much proves it. Using both causes megamorphic calls, preventing inlining and such too. I should write this debugging path for the students!

@bhamiltoncx
Copy link
Contributor Author

Yep, here's a conclusive proof:

bhamiltoncx@9f16557

I made a new pass-through second implementation of CharStream which delegates to ANTLRInputStream, and if I do even a single warmup run to pass the new CharStream implementation into JavaLexer, perf drops by 5-10%.

No new code needed. Jeez, the JIT is complicated. One more argument in favor of JMH..

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Well, sorta ;) Microbenchmark in isolation totally missed this issue as it not a realistic environment to test in practice stuff. Someone could report benchmarks of 25% slower speed for ANTLR erroneously. Sam and I clawed our way to the current speed so it was potentially a problem. @sharwell was the one that asked me to run timing tests on new stuff to verify we didn't erode our win. A wider test such as the one I handrolled caught something that forces my hand on deprecating old shit in favor of your awesome new streams!

@bhamiltoncx
Copy link
Contributor Author

True that!

I think it's also an argument in favor of aggressively making classes final, using lots of static methods, and avoiding "just-in-case" protected members and methods — it's pretty clear the JIT is optimized for single implementations of interfaces/classes.

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Re new interface, what happens when someone sends in charset utf-16 or utf-32?

@bhamiltoncx
Copy link
Contributor Author

Re new interface, what happens when someone sends in charset utf-16 or utf-32?

Functionality will work fine, I'll add a test. (Or do you mean perf-wise?)

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Just responding to comment "For other sources, only supports Unicode code points up to U+FFFF." :)

@parrt
Copy link
Member

parrt commented Mar 29, 2017

I'm pulling this in so we can tweak comments etc... and move forward. Could you freshen interfaces in other languages to be as consistent as possible? thanks!

@parrt parrt merged commit 9764ff7 into antlr:master Mar 29, 2017
@bhamiltoncx
Copy link
Contributor Author

Just responding to comment "For other sources, only supports Unicode code points up to U+FFFF." :)

Oops, comment is definitely out-of-date! Thanks, will follow up with a PR to add a test and fix the comments.

@parrt
Copy link
Member

parrt commented Mar 29, 2017

Do a victory lap, Ben!!!

@bhamiltoncx
Copy link
Contributor Author

Victory Lap

@bhamiltoncx
Copy link
Contributor Author

BTW, from a Java expert at FB, we might want to look at JMH's BULK warmup mode to simulate scenarios like this:

For microbenchmarking, JMH is the way to go. The perfasm profiler is particularly interesting. It will dump the generated assembly and annotate it with the performance trace from linux perf. For the case you described above, you want to make sure to run with BULK warmup mode to get more realistic (polluted) callsite profiling stats.

@bhamiltoncx bhamiltoncx deleted the char-streams-memory-usage branch March 29, 2017 17:24
@sharwell
Copy link
Member

sharwell commented Mar 30, 2017

@bhamiltoncx What was the result of using ANTLRUnicodeInputStream (sharwell/antlr4@51dbd231) instead of a whole new stream infrastructure? One optimization that Java seems to like that I didn't do in this file so far is splitting LA(int) into a short fast path that can be inlined, with the longer slow path after the initial condition moved to a separate method. This stream is optimized for code points no greater than U+FFFF, but the lexer itself is optimized for code points no greater than U+007F so that doesn't seem like a particular problem or mismatch.

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Mar 30, 2017 via email

@bhamiltoncx
Copy link
Contributor Author

Here is an unbelievably amazing write-up of exactly the processes behind the JIT optimizer that explains what we ran into here. Great reading!

https://shipilev.net/blog/2015/black-magic-method-dispatch/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants