Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer #1781

bhamiltoncx · 2017-03-22T01:09:30Z

This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

parrt · 2017-03-23T16:45:27Z

Cool. Does the new lexer timing stuff say anything interesting about the cost of the new streams? I should probably add a memory size check to those tests... Java makes it hard but maybe there's a new API I don't know about.

parrt · 2017-03-23T16:46:01Z

runtime-testsuite/test/org/antlr/v4/runtime/TestCodePointCharStream.java

@@ -3,17 +3,13 @@
 * Use of this file is governed by the BSD 3-clause license that
 * can be found in the LICENSE.txt file in the project root.
 */
-package org.antlr.v4.test.runtime.java;
+package org.antlr.v4.runtime;


Should the test code really be in the runtime instead of the test area?

Good spot! I did this on purpose so I could keep the API CodePointBuffer.getType() package-private. (That's a common pattern to allow tests but not the public to access certain APIs for testing).

ok, we should definitely keep it in the testing things so it does not get pushed out as part of the runtime jar. Can we separate somehow?

Ah, I see the confusion. I didn't move which Maven target it was in, so it won't show up in the runtime jar. I just put it in the same Java package so it could access package-private methods.

Ah. ok. that will be a bit weird in dir tree layout but no biggie

bhamiltoncx · 2017-03-23T17:56:04Z

Does the new lexer timing stuff say anything interesting about the cost of the new streams?

I'm trying to run the new test, but it looks like it passes a URL where we expect a Path:

% java -cp /data/users/beng/src/antlr4/runtime/Java/target/antlr4-runtime-4.7-SNAPSHOT.jar:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT.jar:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT-tests.jar org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed      
Warming up Java compilerException in thread "main" java.nio.file.NoSuchFileException: file:/data/users/beng/src/antlr4/runtime-testsuite/target/antlr4-runtime-testsuite-4.7-SNAPSHOT.jar!/Java/src/org/antlr/v4/runtime/Parser.java
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.Files.newByteChannel(Files.java:361)
        at java.nio.file.Files.newByteChannel(Files.java:407)
        at org.antlr.v4.runtime.CharStreams.fromPath(CharStreams.java:59)
        at org.antlr.v4.runtime.CharStreams.fromPath(CharStreams.java:43)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.lex_new_java_utf8(TimeLexerSpeed.java:208)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.compilerWarmUp(TimeLexerSpeed.java:110)
        at org.antlr.v4.test.runtime.java.api.perf.TimeLexerSpeed.main(TimeLexerSpeed.java:71)

I'll try to fix it.

parrt · 2017-03-23T17:57:52Z

ok, thanks. yeah, I've been running from the development environment not from the jar. I should've known there was going to be an issue finding the files to test...

bhamiltoncx · 2017-03-23T18:27:58Z

Fixed the test: #1787

Ran test on Linux with Intel Xeon E5-2600 @ 2.20 GHz.

Before:

Warming up Java compiler....                                                                                                    
    load_legacy_java_utf8 average time   248us over 3500 loads of 29038 symbols from Parser.java                                
    load_legacy_java_utf8 average time   301us over 3500 loads of 13379 symbols from udhr_hin.txt                               
            load_new_utf8 average time   535us over 3500 loads of 29038 symbols from Parser.java                                
            load_new_utf8 average time   420us over 3500 loads of 13379 symbols from udhr_hin.txt                               
                                                                                                                                
     lex_legacy_java_utf8 average time   625us over 2000 runs of 29038 symbols                                                  
     lex_legacy_java_utf8 average time  1678us over 2000 runs of 29038 symbols DFA cleared                                      
        lex_new_java_utf8 average time   644us over 2000 runs of 29038 symbols                                                  
        lex_new_java_utf8 average time  1534us over 2000 runs of 29038 symbols DFA cleared                                      
                                                                                                                                
 lex_legacy_grapheme_utf8 average time 12950us over  400 runs of  6614 symbols from udhr_kor.txt                                
 lex_legacy_grapheme_utf8 average time 11953us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared                    
 lex_legacy_grapheme_utf8 average time 10144us over  400 runs of 13379 symbols from udhr_hin.txt                                
 lex_legacy_grapheme_utf8 average time 11146us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared                    
    lex_new_grapheme_utf8 average time 11914us over  400 runs of  6614 symbols from udhr_kor.txt                                
    lex_new_grapheme_utf8 average time 12030us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared                    
    lex_new_grapheme_utf8 average time 10400us over  400 runs of 13379 symbols from udhr_hin.txt                                
    lex_new_grapheme_utf8 average time 10365us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared                    
    lex_new_grapheme_utf8 average time   183us over  400 runs of    85 symbols from emoji.txt                                   
    lex_new_grapheme_utf8 average time   192us over  400 runs of    85 symbols from emoji.txt DFA cleared

After:

Warming up Java compiler....
    load_legacy_java_utf8 average time   247us over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   294us over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   754us over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   573us over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   661us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1753us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1313us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  2217us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 11915us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12356us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 10346us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 10506us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12087us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12174us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 10762us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 10873us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   194us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   204us over  400 runs of    85 symbols from emoji.txt DFA cleared

Now that I have the perf test running cleanly in my environment, I'll attach YourKit and see what I can fix.

bhamiltoncx · 2017-03-23T21:22:55Z

OK, with #1788 on top of this:

    load_legacy_java_utf8 average time   272us size 132788b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   745us size 116765b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt

I know how to improve the perf a few ways as well.

parrt · 2017-03-23T21:34:59Z

Hmm...load times seems slow (you got an SSD and fast clock)? Also, the sizes are bigger not smaller for Parser.java.

bhamiltoncx · 2017-03-23T21:47:01Z

Yep, I profiled the load times and I see the bottleneck.

Java's native UTF-8 decoder takes advantage of the raw array interface of ByteBuffer to avoid lots of virtual function calls to access the bytes inside:

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/cs/UTF_8.java#UTF_8.Decoder.decodeArrayLoop%28java.nio.ByteBuffer%2Cjava.nio.CharBuffer%29

I'll do the same.

How did you come to the conclusion that the sizes are bigger and not smaller for Parser.java? Did I misread the output?

parrt · 2017-03-23T21:51:01Z

My run shows:

     load_legacy_java_utf8 average time    61us size  64788b over 3500 loads of 29038 symbols from Parser.java
     load_legacy_java_utf8 average time   122us size 102088b over 3500 loads of 13379 symbols from udhr_hin.txt
             load_new_utf8 average time   212us size 176187b over 3500 loads of 29038 symbols from Parser.java
             load_new_utf8 average time   206us size  49439b over 3500 loads of 13379 symbols from udhr_hin.txt

but you are showing (132,788 > 64,788 for Parser.java, 128,671>102,088 for udhr_hin.txt though your utf8 is smaller for Parser.java):

    load_legacy_java_utf8 average time   272us size 132788b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   745us size 116765b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt

I was hoping to see Parser.java be 50% size of 64,788 for legacy load.

bhamiltoncx · 2017-03-23T21:53:26Z

I think your run might have been missing the actual memory needed to load the resource from the JAR. Mine includes it as part of #1788 . I profiled the memory usage and can get it down another 20% or so. :) Ben

…

On Thu, Mar 23, 2017 at 2:51 PM Terence Parr ***@***.***> wrote: My run shows: load_legacy_java_utf8 average time 61us size 64788b over 3500 loads of 29038 symbols from Parser.java load_legacy_java_utf8 average time 122us size 102088b over 3500 loads of 13379 symbols from udhr_hin.txt load_new_utf8 average time 212us size 176187b over 3500 loads of 29038 symbols from Parser.java load_new_utf8 average time 206us size 49439b over 3500 loads of 13379 symbols from udhr_hin.txt but you are showing (132,788 > 64,788 for Parser.java, 128,671>102,088 for udhr_hin.txt though your utf8 is smaller for Parser.java): load_legacy_java_utf8 average time 272us size 132788b over 3500 loads of 29038 symbols from Parser.java load_legacy_java_utf8 average time 305us size 128671b over 3500 loads of 13379 symbols from udhr_hin.txt load_new_utf8 average time 745us size 116765b over 3500 loads of 29038 symbols from Parser.java load_new_utf8 average time 539us size 119317b over 3500 loads of 13379 symbols from udhr_hin.txt I was hoping to see Parser.java be 50% size of 64,788 for legacy load. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU32JXgqkOftQLOWf9WtjTKWKT9LQks5roulJgaJpZM4MkmrE> .

parrt · 2017-03-23T21:54:45Z

ah. damn java is hard to profile stuff. Hmm...ok, i'll take a look later.

parrt · 2017-03-23T21:56:09Z

Actually, shouldn't this avoid ClassLoader related memory requirements?

	public void load_legacy_java_ascii(int n) throws Exception {
		URL sampleJavaFile = TimeLexerSpeed.class.getClassLoader().getResource(Parser_java_file);
		long start = System.nanoTime();
		CharStream[] input = new CharStream[n]; // keep refs around so we can average memory
		System.gc();
		long beforeFreeMem = Runtime.getRuntime().freeMemory();

parrt · 2017-03-23T21:58:40Z

Oh, nevermind, i see the diff in our code. yeah, we should try to isolate the buffer size if we can.

bhamiltoncx · 2017-03-23T23:42:30Z

With a bit o' wrangling:

Warming up Java compiler....
    load_legacy_java_utf8 average time   270us size 132216b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   290us size 127886b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   253us size 114134b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   310us size 112967b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   804us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1804us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1308us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  2418us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 11883us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12017us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 10456us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 10347us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12161us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12350us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 11154us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 11117us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   213us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   202us over  400 runs of    85 symbols from emoji.txt DFA cleared

bhamiltoncx · 2017-03-23T23:44:20Z

OK, I ended up removing my hand-written UTF8CodePointDecoder completely and just relying on the built-in JDK CharsetDecoder to convert from arbitrary encodings to UTF-16, then convert UTF-16 to Unicode code points.

Memory usage is still better than before, but I'll profile more to see what I can prune.

BTW, #1788 is included inside this PR now. I'll rebase whenever #1788 lands.

bhamiltoncx · 2017-03-24T17:34:23Z

YourKit profiling is a wonderful thing!

I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

Et voilà, perf is in line with or slightly better than the original ANTLRInputStream and memory usage is better as well:

Warming up Java compiler....
    load_legacy_java_utf8 average time   275us size 132669b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   292us size 128806b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   266us size 114950b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   311us size 113672b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time  1065us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1971us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time  1026us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1960us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time 12545us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time 12611us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 11303us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 11195us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time 12532us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time 12387us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 10856us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time 11043us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   187us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   197us over  400 runs of    85 symbols from emoji.txt DFA cleared

bhamiltoncx · 2017-03-24T17:35:18Z

(I sure wish Java had proper generics on primitive types.. This whole type erasure thing was definitely not the right call!)

ericvergnaud · 2017-03-24T17:57:21Z

Ben, unrelated to this but you may want to look at #1786 and comment Eric

…

Le 25 mars 2017 à 01:35, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit : (I sure wish Java had proper generics on primitive types.. This whole type erasure thing was definitely not the right call!) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJGE5bY15W909JeVW0JFXos4wDOmNks5ro_7ZgaJpZM4MkmrE>.

bhamiltoncx · 2017-03-24T19:25:11Z

Found a nice ~50% speed boost in the lexer by making IntervalSet.contains(el) use binary search instead of a linear scan.

Sent out a separate PR for that: #1789

Latest numbers:

    load_legacy_java_utf8 average time   273us size 132683b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   293us size 128847b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   262us size 115666b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   313us size 113759b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time  1049us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1957us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   785us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1842us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  7223us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  7226us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time 13136us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time 13283us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared    lex_new_grapheme_utf8 average time  7270us over  400 runs of  6614 symbols from udhr_kor.txt    lex_new_grapheme_utf8 average time  7420us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time 13323us over  400 runs of 13379 symbols from udhr_hin.txt    lex_new_grapheme_utf8 average time 13433us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time   148us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time   150us over  400 runs of    85 symbols from emoji.txt DFA cleared

parrt · 2017-03-24T21:02:55Z

My phat box gets these numbers now (with IntervalSet binary search update and with an adjustment to memory size calculation but NOT these new stream changes):

Java VM args: -Xms2G -Xmx2G 
Warming up Java compiler....
   load_legacy_java_ascii average time    44us size  54301b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time    46us size  94675b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   125us size  37490b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time   201us size 145893b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   202us size  57482b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   367us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time   907us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   434us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time   963us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  4028us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  4128us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time  7661us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time  7581us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time  4026us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time  4115us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time  7547us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time  7613us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time    80us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time    88us over  400 runs of    85 symbols from emoji.txt DFA cleared

parrt · 2017-03-24T21:16:29Z

Weird. here are the numbers after bringing in these changes into a test branch. The legacy stuff is noticeably slower:

Java VM args: -Xms2G -Xmx2G 
Warming up Java compiler....
   load_legacy_java_ascii average time    46us size  54315b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time    48us size  59922b over 3500 loads of 29038 symbols from Parser.java
    load_legacy_java_utf8 average time   118us size  22960b over 3500 loads of 13379 symbols from udhr_hin.txt
            load_new_utf8 average time    55us size  21470b over 3500 loads of 29038 symbols from Parser.java
            load_new_utf8 average time   121us size  24466b over 3500 loads of 13379 symbols from udhr_hin.txt

     lex_legacy_java_utf8 average time   601us over 2000 runs of 29038 symbols
     lex_legacy_java_utf8 average time  1130us over 2000 runs of 29038 symbols DFA cleared
        lex_new_java_utf8 average time   637us over 2000 runs of 29038 symbols
        lex_new_java_utf8 average time  1145us over 2000 runs of 29038 symbols DFA cleared

 lex_legacy_grapheme_utf8 average time  4247us over  400 runs of  6614 symbols from udhr_kor.txt
 lex_legacy_grapheme_utf8 average time  4283us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
 lex_legacy_grapheme_utf8 average time  7745us over  400 runs of 13379 symbols from udhr_hin.txt
 lex_legacy_grapheme_utf8 average time  7851us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time  4201us over  400 runs of  6614 symbols from udhr_kor.txt
    lex_new_grapheme_utf8 average time  4299us over  400 runs of  6614 symbols from udhr_kor.txt DFA cleared
    lex_new_grapheme_utf8 average time  7722us over  400 runs of 13379 symbols from udhr_hin.txt
    lex_new_grapheme_utf8 average time  7824us over  400 runs of 13379 symbols from udhr_hin.txt DFA cleared
    lex_new_grapheme_utf8 average time    81us over  400 runs of    85 symbols from emoji.txt
    lex_new_grapheme_utf8 average time    85us over  400 runs of    85 symbols from emoji.txt DFA cleared

I don't see how that is possible given changes in this PR don't affect legacy streams or lexer in any way I can see. @bhamiltoncx ? any ideas? Note that some of the new numbers are much better due to the new streams stuff. And (woot!) size of streams is smaller for ascii java code using new streams.

bhamiltoncx · 2017-03-28T23:19:10Z

wat.gif

…

On Tue, Mar 28, 2017, 6:14 PM Terence Parr ***@***.***> wrote: Warming up with new streams, testing legacy, I can knock from 455ms to 355ms by create new JavaLexer each iteration in tokenize(): CharStream input = lexer.getInputStream(); <-- add this for (int i = 0; i<n; i++) { lexer = new JavaLexer(input); <--- add this lexer.reset(); — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU-xjuoihyhHLsuhXv_N5sEQrsUBtks5rqZQ4gaJpZM4MkmrE> .

parrt · 2017-03-28T23:54:51Z

Hmm...ok, seems to bounce between 350 and 450ms with or w/o the new JavaLexer. This is with 30,000 runs, dropping highest 10% CPU time runs. legacy warmup is rock solid 285ms testing repeatedly with 30,000 runs. something is causing this but I'll be damned if I know!

bhamiltoncx · 2017-03-29T15:43:36Z

So, I've determined it's not the CharStreams load, and it's not constructing the Lexer, but actually invoking the lexer on the Java source with the new stream that angers something in the legacy stream.

bhamiltoncx · 2017-03-29T15:53:34Z

FYI, I'm seeing the behavior where the legacy lexer slows down 5-10% when we warm up the new lexer on master, so I don't think it's related to this PR.

parrt · 2017-03-29T15:57:08Z

I saw same thing yesterday. I believe I have a good hypothesis. If it ain't the data that's changing, it must be the code. If the code ain't changing it must be the JIT optimization. We got a hint about this yesterday when consume() all of a sudden pops up as significant. I believe it is the dynamic dispatch through an interface (vs class hierarchy) that prevents optimization. I have an experiment to try as soon as I get in, but this theory smacks of correctness.

parrt · 2017-03-29T15:57:54Z

Debugging == the scientific method of computing. :)

bhamiltoncx · 2017-03-29T16:03:42Z

Ahhh. I wouldn't be at all surprised if the JIT was optimized for a single implementation of an interface.

That would avoid the vtable dispatch in consume() if we don't warm up the new lexer, but as soon as we warm up the new lexer, we have two implementations of IntStream.consume(), so the JIT has to fall back to vtable dispatch.

bhamiltoncx · 2017-03-29T16:05:24Z

Yup, that's it. Check out the last two bullets:

https://wiki.openjdk.java.net/display/HotSpot/PerformanceTechniques

Methods

Methods are often inlined. This increases the compiler's "horizon" of optimization.
Static, private, final, and/or "special" invocations are easy to inline.
Virtual (and interface) invocations are often demoted to "special" invocations, if the class hierarchy permits it. A dependency is registered in case further class loading spoils things.
Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
Depending on the profile, a failure of the optimistic check will either deoptimize or run through a (slow) vtable/itable call.
On the fast path of an optimistically typed call, inlining is common. The best case is a de facto monomorphic call which is inlined. Such calls, if back-to-back, will perform the receiver type check only once.
In the absence of strong profiling information, a virtual (or interface) call site will be compiled in an agnostic state, waiting for the first execution to provide a provisional monomorphic receiver. (This is called an "inline cache".)
An inline cache will flip to a monomorphic state at the first call, and stay in that state as long as the exact receiver type (not a subtype) is repeated every time.
An inline cache will flip to a "megamorphic" state if a second receiver type is encountered.
Megamorphic calls use assembly-coded vtable and itable stubs, patched in by the JVM. The compiler does not need to manage them.

bhamiltoncx · 2017-03-29T16:06:27Z

That explains why the JMH benchmark isn't impacted—it only ever sees one implementation of IntStream (each test is run in its own forked JVM), so the JIT optimizes everything out the wazoo.

bhamiltoncx · 2017-03-29T16:09:46Z

So, tl;dr: don't mix ANTLRInputStream and CodePointCharStream in the same Lexer and you should be fine. :)

I think this is strong evidence that we should just mark ANTLRInputStream and ANTLRFileStream as @Deprecated and then remove them in 4.8.

parrt · 2017-03-29T16:28:57Z

Victory! Yep, we mark as deprecated and use new interface. let me scour interface today so we can move forward with 4.7 release :) I was going to bust out https://github.com/AdoptOpenJDK/jitwatch but maybe I can "prove" this with some code duplication...maybe make dummy common root to ANTLRInputStream and CodePointStream?

parrt · 2017-03-29T16:36:47Z

Confirmed. If you run just one test, lex_new_java_utf8, without warmup it's slightly faster than old stream. If you add legacy warmup, now new stream test is slow again. I think that pretty much proves it. Using both causes megamorphic calls, preventing inlining and such too. I should write this debugging path for the students!

bhamiltoncx · 2017-03-29T16:39:57Z

Yep, here's a conclusive proof:

bhamiltoncx@9f16557

I made a new pass-through second implementation of CharStream which delegates to ANTLRInputStream, and if I do even a single warmup run to pass the new CharStream implementation into JavaLexer, perf drops by 5-10%.

No new code needed. Jeez, the JIT is complicated. One more argument in favor of JMH..

parrt · 2017-03-29T16:44:34Z

Well, sorta ;) Microbenchmark in isolation totally missed this issue as it not a realistic environment to test in practice stuff. Someone could report benchmarks of 25% slower speed for ANTLR erroneously. Sam and I clawed our way to the current speed so it was potentially a problem. @sharwell was the one that asked me to run timing tests on new stuff to verify we didn't erode our win. A wider test such as the one I handrolled caught something that forces my hand on deprecating old shit in favor of your awesome new streams!

bhamiltoncx · 2017-03-29T16:46:42Z

True that!

I think it's also an argument in favor of aggressively making classes final, using lots of static methods, and avoiding "just-in-case" protected members and methods — it's pretty clear the JIT is optimized for single implementations of interfaces/classes.

parrt · 2017-03-29T16:51:58Z

Re new interface, what happens when someone sends in charset utf-16 or utf-32?

bhamiltoncx · 2017-03-29T16:52:50Z

Re new interface, what happens when someone sends in charset utf-16 or utf-32?

Functionality will work fine, I'll add a test. (Or do you mean perf-wise?)

parrt · 2017-03-29T17:00:33Z

Just responding to comment "For other sources, only supports Unicode code points up to U+FFFF." :)

parrt · 2017-03-29T17:01:19Z

I'm pulling this in so we can tweak comments etc... and move forward. Could you freshen interfaces in other languages to be as consistent as possible? thanks!

bhamiltoncx · 2017-03-29T17:02:11Z

Just responding to comment "For other sources, only supports Unicode code points up to U+FFFF." :)

Oops, comment is definitely out-of-date! Thanks, will follow up with a PR to add a test and fix the comments.

parrt · 2017-03-29T17:02:45Z

Do a victory lap, Ben!!!

bhamiltoncx · 2017-03-29T17:04:34Z

bhamiltoncx · 2017-03-29T17:24:16Z

BTW, from a Java expert at FB, we might want to look at JMH's BULK warmup mode to simulate scenarios like this:

For microbenchmarking, JMH is the way to go. The perfasm profiler is particularly interesting. It will dump the generated assembly and annotate it with the performance trace from linux perf. For the case you described above, you want to make sure to run with BULK warmup mode to get more realistic (polluted) callsite profiling stats.

sharwell · 2017-03-30T10:28:09Z

@bhamiltoncx What was the result of using ANTLRUnicodeInputStream (sharwell/antlr4@51dbd231) instead of a whole new stream infrastructure? One optimization that Java seems to like that I didn't do in this file so far is splitting LA(int) into a short fast path that can be inlined, with the longer slow path after the initial condition moved to a separate method. This stream is optimized for code points no greater than U+FFFF, but the lexer itself is optimized for code points no greater than U+007F so that doesn't seem like a particular problem or mismatch.

bhamiltoncx · 2017-03-30T13:12:49Z

@sharwell: Since we read the entire file upfront anyway, I felt it made sense to avoid any runtime logic altogether and decode to code points when we read the file into memory. That way, LA1() becomes a simple array access, no conditionals, The profiling results show the end result uses less memory (byte array for Latin input) and the same amount of CPU.

…

On Thu, Mar 30, 2017 at 5:28 AM Sam Harwell ***@***.***> wrote: @bhamiltoncx <https://github.com/bhamiltoncx> What was the result of using ANTLRUnicodeInputStream ***@***.*** <sharwell@51dbd231>) instead of a whole new stream? One optimization that Java seems to like that I didn't do in this file so far is splitting LA(int) into a short fast path that can be inlined, with the longer slow path after the initial condition moved to a separate method. This stream is optimized for code points no greater than U+FFFF, but the lexer itself is optimized for code points no greater than U+007F so that doesn't seem like a particular problem or mismatch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU3-Tp3q9TIFqwQLSUJLlOvun9q0oks5rq4O-gaJpZM4MkmrE> .

bhamiltoncx · 2017-04-01T01:57:39Z

Here is an unbelievably amazing write-up of exactly the processes behind the JIT optimizer that explains what we ran into here. Great reading!

https://shipilev.net/blog/2015/black-magic-method-dispatch/

bhamiltoncx mentioned this pull request Mar 22, 2017

A few updates to the Unicode documentation. #1771

Merged

parrt reviewed Mar 23, 2017

View reviewed changes

bhamiltoncx mentioned this pull request Mar 23, 2017

Fix TimeLexerSpeed when run from jar #1787

Closed

bhamiltoncx force-pushed the char-streams-memory-usage branch from d38979c to aae0f2a Compare March 23, 2017 23:42

bhamiltoncx changed the title ~~Improve memory usage of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer~~ Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer Mar 24, 2017

bhamiltoncx force-pushed the char-streams-memory-usage branch 2 times, most recently from 8ccdc3f to d7f3481 Compare March 24, 2017 19:44

parrt added this to the 4.7 milestone Mar 24, 2017

parrt added comp:performance lexers labels Mar 24, 2017

bhamiltoncx force-pushed the char-streams-memory-usage branch from 627a0c7 to ab06555 Compare March 29, 2017 14:37

parrt merged commit 9764ff7 into antlr:master Mar 29, 2017

bhamiltoncx mentioned this pull request Mar 29, 2017

Tidy comments and add tests to ensure UTF-16 and UTF-32 support Unicode code points > U+FFFF #1794

Merged

bhamiltoncx deleted the char-streams-memory-usage branch March 29, 2017 17:24

Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer #1781

Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer #1781

Conversation

bhamiltoncx commented Mar 22, 2017 • edited Loading

parrt commented Mar 23, 2017

parrt Mar 23, 2017

Choose a reason for hiding this comment

bhamiltoncx Mar 23, 2017

Choose a reason for hiding this comment

parrt Mar 23, 2017

Choose a reason for hiding this comment

bhamiltoncx Mar 23, 2017

Choose a reason for hiding this comment

parrt Mar 23, 2017

Choose a reason for hiding this comment

bhamiltoncx commented Mar 23, 2017

parrt commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017

parrt commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017 • edited Loading

parrt commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017 via email

parrt commented Mar 23, 2017

parrt commented Mar 23, 2017

parrt commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017

bhamiltoncx commented Mar 23, 2017 • edited Loading

bhamiltoncx commented Mar 24, 2017

bhamiltoncx commented Mar 24, 2017

ericvergnaud commented Mar 24, 2017 via email

bhamiltoncx commented Mar 24, 2017 • edited Loading

parrt commented Mar 24, 2017 • edited Loading

parrt commented Mar 24, 2017

bhamiltoncx commented Mar 28, 2017 via email

parrt commented Mar 28, 2017

bhamiltoncx commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

parrt commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

parrt commented Mar 29, 2017 • edited Loading

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

parrt commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

parrt commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

sharwell commented Mar 30, 2017 • edited Loading

bhamiltoncx commented Mar 30, 2017 via email

bhamiltoncx commented Apr 1, 2017

bhamiltoncx commented Mar 22, 2017 •

edited

Loading

bhamiltoncx commented Mar 23, 2017 •

edited

Loading

bhamiltoncx commented Mar 23, 2017 •

edited

Loading

bhamiltoncx commented Mar 24, 2017 •

edited

Loading

parrt commented Mar 24, 2017 •

edited

Loading

parrt commented Mar 29, 2017 •

edited

Loading

sharwell commented Mar 30, 2017 •

edited

Loading