Switch parser library to megaparsec #16

bennofs · 2014-08-25T18:44:50Z

The parser is currently quite slow compared to the C parser:

$ time nix-instantiate --parse ~/nixpkgs/pkgs/top-level/all-packages.nix > /dev/null
0.10user 0.02system 0:00.36elapsed 33%CPU (0avgtext+0avgdata 8908maxresident)k
0inputs+0outputs (0major+2511minor)pagefaults 0swaps

compare that to hnix:

$ time dist/build/hnix/hnix ~/nixpkgs/pkgs/top-level/all-packages.nix  > /dev/null 
hnix: Operators are not yet defined

real    0m6.914s
user    0m4.056s
sys 0m0.172s

I don't know how to improve performance though.

The text was updated successfully, but these errors were encountered:

Mathnerd314 · 2016-02-26T21:50:20Z

Happy is much closer to the Bison parser that Nix uses, and probably of comparable speed. Does anyone mind if I switch hnix away from trifecta/parsec?

jwiegley · 2016-02-27T00:26:16Z

I do. We just need to optimize the present parser. I think many more people are comfortable with reading and editing Parsec, than they are using Happy. So, for a speed gain, we'd reduce our pool of candidate contributors. I'm pretty sure we can get there with Trifecta, we probably just have some nasty backtracking happening in a few key places.

Mathnerd314 · 2016-02-27T00:36:24Z

Hmm. Well, I for one am not sure how to get Parsec/Trifecta to parse the let-body syntax:
let { a = 1; body = 2; }
This is a perfectly fine Nix expression, but HNix expects it to be of the form let { ... } in ....

Happy has a GLR mode which seems to be working, although it is proving difficult to integrate it with the rest of hnix.

cstrahan · 2016-03-09T21:21:23Z

Hmm. Well, I for one am not sure how to get Parsec/Trifecta to parse the let-body syntax:
let { a = 1; body = 2; }

Wow, I didn't know that was legal syntax. Crazy.

steshaw · 2016-12-21T05:24:32Z

The performance with respect to the C parser seems to have improved:

$ time nix-instantiate --parse ~/.nix-defexpr/channels/nixpkgs/pkgs/top-level/all-packages.nix  >/dev/null

real	0m0.139s
user	0m0.120s
sys	0m0.010s

$ time dist/build/hnix/hnix ~/.nix-defexpr/channels/nixpkgs/pkgs/top-level/all-packages.nix >/dev/null

real	0m1.306s
user	0m1.236s
sys	0m0.063s

Are further improvements expected?

jwiegley · 2016-12-21T05:27:52Z

I certainly hope so; an order of magnitude seems a bit excessive to me. I'm not sure a proper review has been done of the back-tracking in hnix. This should be much easier now with my parsec-free library, since the evaluator can be tailored to report all backtracking.

steshaw · 2016-12-21T05:36:58Z

@jwiegley sounds interesting. How would I tailor parsec-free to report the backtracking?

jwiegley · 2016-12-21T05:50:43Z

You'll need to switch over to using parsec, which I believe is still just a matter of a flag (unless that has changed since I last looked at the parser). Then change the dependency from parsec to parsec-free. Check out the dump and look for invocations of try, or edit the evaluator in Parsec.Text.Free.Eval and modify it to print whenever a backtracking parse is attempted (you can even report how many bytes it rewound, if you annotate the details reported by attempt).

jwiegley · 2016-12-21T05:52:10Z

Oh, and to get a dump, you'll need to use parseTestLog. See the README for more details.

steshaw · 2016-12-21T05:58:04Z

Thanks, @jwiegley. Things don't quite compile with -fparsec at the moment as there's an unexpected dependency on Trifecta.Delta for position. I've cobbled something together though that compiles (and doesn't error at runtime).

I'll try switching to parsec-free and parseTestLog and see how things go!

steshaw · 2016-12-21T08:42:14Z

I managed to get something working with parsec-free. I've run the the result on simple-pretty.nix, the output can be found in this gist. I don't rightly understand the output but I hope it's helpful (Warning: it's very long). How do you identify backtracking with parsec-free? Is it the bracketed nodes?

The code is a hack. I had to add orphan instances for CharParsing, TokenParsing and Parsing. The package requires trifecta in addition to parsec-free (due to some seeping of DeltaParsing into the code). I also had to comment out some code where I haven't worked through the type errors. Though you'd never want to merge the code, I thought it might be helpful to publish it - see my parsec-free-hack branch.

jwiegley · 2016-12-21T17:28:09Z

The nodes in parentheses represents failed parsing attempts, where the failure did not consume input. Only try rewinds the current position on failure, so any non-try failures mean it aborted simply by testing the current token. However, it does mean that a lot of unnecessary tests happened, depending on how deep the failure tree is.

For this very simple example there are 1,482 failed try attempts, which is excessive for a 29 line long file!

We should see if reorganizing the parser can allow us to reduce the number of failed attempts, of which there are 4,309 total for your example. I can only imagine what all-packages.nix must incur.

domenkozar · 2017-06-01T08:07:07Z

Current status:

$ time nix-instantiate --parse pkgs/top-level/all-packages.nix > /dev/null

real    0m0.065s
user    0m0.059s
sys     0m0.006s

$ time $(nix-build -A haskellPackages.hnix)/bin/hnix pkgs/top-level/all-packages.nix  >/dev/null

real    0m2.095s
user    0m1.960s
sys     0m0.091s

jwiegley · 2017-06-02T21:43:15Z

Yeah, that kind of slow-down is really not acceptable.

jwiegley · 2018-04-06T04:02:56Z

At the moment Trifecta is very slow and memory consumptive. It dominates the profiling output (though maybe @ekmett knows what we're doing wrong). Yet it gives such great error message, it would be hard to give up.

I wonder if we could use both attoparsec and Trifecta, and if parsing fails with the former, reparse with the latter for the error message.

I mention this because right now parsing the entirety of nixpkgs takes several minutes, which is unacceptable.

jwiegley · 2018-04-06T04:03:06Z

Here's the profiling results for all of nixpkgs. It appears that much of the slowness is due to the enormous turn-over in temporary allocations:

        Thu Apr  5 20:50 2018 Time and Allocation Profiling Report  (Final)

           hnix +RTS -p -RTS -f -

        total time  =      243.46 secs   (243463 ticks @ 1000 us, 1 processor)
        total alloc = 310,382,398,904 bytes  (excludes profiling overheads)

COST CENTRE       MODULE                                SRC                                             >

>>=.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(144,5)-(171,13)    >
satisfy.\         Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(227,5)-(235,58)    >
fmap.\            Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:108:46-67           >
<?>.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(204,41)-(206,50)   >
<$.\              Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:110:42-71           >
<|>.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:123:5-88            >
<|>.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:123:17-76           >
>>=.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(151,24)-(169,15)   >
unconsB           Codec.Binary.UTF8.Generic             Codec/Binary/UTF8/Generic.hs:297:1-18           >
>>=.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:147:16-88           >
parseFromFileEx   Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(365,1)-(367,70)    >
try.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:202:45-72           >
decode            Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(68,1)-(124,32)         >
string            Text.Parser.Char                      src/Text/Parser/Char.hs:216:3-51                >
identifier        Nix.Parser.Library                    Nix/Parser/Library.hs:99:1-46                   >
decode.choose     Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(72,3)-(78,39)          >
displayIO.display Text.PrettyPrint.ANSI.Leijen.Internal Text/PrettyPrint/ANSI/Leijen/Internal.hs:(1159,7>
<?>.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:206:13-49           >
<?>               Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(204,3)-(206,50)    >
char              Text.Parser.Char                      src/Text/Parser/Char.hs:196:3-38                >
>>=               Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(143,3)-(171,13)    >
fmap              Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:108:3-67            >
manyAccum.\.walk  Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:190:7-119           >
>>=.\.\.\         Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:158:20-51           >
uncons            Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(166,1)-(167,38)        >
<>                Text.Trifecta.Delta                   src/Text/Trifecta/Delta.hs:(119,3)-(134,101)    >
manyAccum.\       Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(190,3)-(193,47)    >

jwiegley · 2018-04-06T04:03:15Z

Here's what these numbers and errors need to be compared with, by running:

time find ~/src/nix/nixpkgs -name '*.nix' | xargs nix-instantiate --parse > /dev/null

which gives:

error: undefined variable ‘hostPlatform’ at /Users/johnw/src/nix/nixpkgs/pkgs/os-specific/windows/pthread-w32/default.nix:4:8
error: undefined variable ‘URL’ at /Users/johnw/src/nix/nixpkgs/pkgs/development/interpreters/python/cpython/docs/template.nix:8:11
error: undefined variable ‘placeholder’ at /Users/johnw/src/nix/nixpkgs/pkgs/development/libraries/pipewire/default.nix:37:46

real	0m3.158s
user	0m2.499s
sys	0m1.516s

jwiegley · 2018-04-06T04:15:50Z

With profiling disabled and -O2, hnix completes in 54s, so presently 17.4x slower.

jwiegley · 2018-04-06T06:55:56Z

@ekmett Suggests the attoparsec-then-trifecta approach, since we've already abstracted the parser using the parsers library.

jwiegley · 2018-04-09T16:59:34Z

I'm actually just going to switch to megaparsec, since not only do we need speed, but we also need to be able to query the current stream position during parsing, which attoparsec does not provide.

mightybyte · 2018-04-09T17:07:37Z

While you're at it you may want to keep in mind the error messages. I just discovered that hnix --eval data/nix/tests/lang/eval-okay-regex-split.nix gives an atrocious error message when one of the assertions in there fails (with the good implementation of builtins.split found at #154). I'm not sure if this will be covered by this parser work or if it should be a separate issue.

jwiegley · 2018-04-09T17:17:22Z

This work has started on the megparsec branch, and should be working early this week.

ryantrinkle · 2018-04-09T18:20:40Z

@mightybyte I think the main issue with the error message is that the inner string needs to get putStrLned rather than printed.

Ericson2314 · 2018-04-09T18:42:02Z

I feel like @taktoa did some megaparsec thing that might be useful, but maybe that was just for Ninja.

taktoa · 2018-04-09T18:44:33Z

I've never parsed Nix with megaparsec, but NixOS/nix#1102 and https://github.com/groxxda/nixexpr may be useful to you.

jwiegley · 2018-04-09T18:52:08Z

@mightybyte Check our how the NixLanguageTests print their error messages. I’ll move that to Main.hs if you don’t get to it before me.

cbarrett · 2018-04-10T17:04:58Z

@jwiegley Let me know if there's stuff that can be done, I've worked with a number of different Parsec-inspired libraries in Haskell, so I feel like I'd be comfortable helping out some with the migration.

jwiegley · 2018-04-10T18:11:42Z

We're now switched to megaparsec, and comparing apples to apples, we're twice as slow as nix:

time find ~/src/nix/nixpkgs -name '*.nix' | while read file ; do nix-instantiate --parse $file > /dev/null ; done
  => 0.85s user 1.70s system 1% cpu 2:29.75 total

time find ~/src/nix/nixpkgs -name '*.nix' | while read file ; do ./dist/build/hnix/hnix $file > /dev/null ; done
  => real 4m50.095s, user 1m33.585s, sys 1m15.932s

I find it interesting that Nix spends so little time doing any computing, so that the 2.5 minutes is probably being spent just dumping the pretty print results into /dev/null.

The megaparsec parser is currently slow due to <|>. I've reduced the amount of back-tracking and alternatives quite a bit, but there's still more to be done there, if others with expertise in megaparsec would like to take a look.

In terms of pure parsing only, I see the following with -O2 -fexpose-all-unfoldings:

real	0m34.172s
user	0m34.233s
sys	0m2.141s

And here are the profiling results:

hnix.prof.zip

jwiegley · 2018-04-10T18:13:07Z

@cbarrett Would you like to do some performance work in Parser.hs? I shouldn't be modifying it too much for the rest of the week.

cbarrett · 2018-04-10T20:10:52Z

@jwiegley I'll take a look this week and let you know whether or not I find anything

nh2 · 2018-06-14T01:35:38Z

@jwiegley One possible reason for slowness: Data.Text.readFile. Using ByteString.readFile and decodeUtf8' might be significantly faster, as it doesn't go thorugh the encoding-based Handle code path, which is known to be slow.

hnix/src/Nix/Parser.hs

Line 425 in 9931386

txt <- liftIO (T.readFile path)

https://www.snoyman.com/blog/2016/12/beware-of-readfile

jwiegley · 2018-06-14T19:00:34Z

@nh2 Nice idea, but it didn't make a change in the overall speed. The profiling results show that it's due to allocation in megaparsec, rather than decoding in readFile.

jwiegley · 2018-06-14T19:09:33Z

@nh2 And strangely enough, using Text.readFile it takes 45s to parse all of nixpkgs, but following Michael's recommended approach takes 85s.

nh2 · 2018-06-15T02:59:00Z

@jwiegley I can't reproduce it with the following code:

module Main (main) where

import           Control.Monad (void)
import qualified Data.ByteString as BS
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.Text.IO as T
import           System.Environment (getArgs)

main :: IO ()
main = do
  args <- getArgs
  case args of
    ["bs", path] -> print =<< (fmap T.length <$> T.decodeUtf8' <$> BS.readFile path)
    ["text", path] -> print =<< (T.length <$> T.readFile path)
    _ -> error "takes 2 arguments (first is 'bs' or 'text', second is the file)"

and a 600 MB example text file:

% time ./read-performance bs verymuchjson
Right 650563173
./read-performance bs verymuchjson  0.67s user 0.54s system 99% cpu 1.209 total

% time ./read-performance text verymuchjson
650563173
./read-performance text verymuchjson  2.51s user 1.02s system 99% cpu 3.531 total

So for me T.decodeUtf8' <$> BS.readFile path is 3x faster than T.readFile.

nh2 · 2018-06-15T13:25:24Z

@jwiegley A few more points

As per the blog post, T.readFile is also problematic from a correctness point of view
What environment are you running this in? T.readFile uses Handle's encoding guessing logic, results may vary based on your LC_* or other environment variables.
Can you also compare with rio's readFileUtf8?

To reduce boilerplate and make the configuration more terse - used Travis way of providing secret key (via env), discovered that style of provisioning does not work in GitHub Cachix, it currently works only in another way, through explicitly providing Cachix GitHub Action internal option for every Cachix GitHub Action use (every build). M .github/workflows/Nixpkgs-Linux-additional.yml M .github/workflows/Nixpkgs-Linux-main.yml M .github/workflows/Nixpkgs-macOS.yml

bennofs added enhancement labels Aug 25, 2014

adnelson mentioned this issue Jun 23, 2015

not evaluating in hnix executable, only parsing #22

Merged

jwiegley mentioned this issue Apr 6, 2018

Need a faster parsing library #147

Closed

jwiegley changed the title ~~Improve performance of parser~~ Switch parser library to megaparsec Apr 9, 2018

jwiegley closed this as completed Apr 10, 2018

Switch parser library to megaparsec #16

Switch parser library to megaparsec #16

Comments

bennofs commented Aug 25, 2014

Mathnerd314 commented Feb 26, 2016

jwiegley commented Feb 27, 2016

Mathnerd314 commented Feb 27, 2016

cstrahan commented Mar 9, 2016

steshaw commented Dec 21, 2016

jwiegley commented Dec 21, 2016

steshaw commented Dec 21, 2016

jwiegley commented Dec 21, 2016 • edited Loading

jwiegley commented Dec 21, 2016

steshaw commented Dec 21, 2016

steshaw commented Dec 21, 2016 • edited Loading

jwiegley commented Dec 21, 2016 • edited Loading

domenkozar commented Jun 1, 2017

jwiegley commented Jun 2, 2017

jwiegley commented Apr 6, 2018

jwiegley commented Apr 6, 2018

jwiegley commented Apr 6, 2018

jwiegley commented Apr 6, 2018

jwiegley commented Apr 6, 2018

jwiegley commented Apr 9, 2018

mightybyte commented Apr 9, 2018

jwiegley commented Apr 9, 2018

ryantrinkle commented Apr 9, 2018

Ericson2314 commented Apr 9, 2018

taktoa commented Apr 9, 2018

jwiegley commented Apr 9, 2018

cbarrett commented Apr 10, 2018

jwiegley commented Apr 10, 2018

jwiegley commented Apr 10, 2018

cbarrett commented Apr 10, 2018

nh2 commented Jun 14, 2018 • edited Loading

jwiegley commented Jun 14, 2018

jwiegley commented Jun 14, 2018

nh2 commented Jun 15, 2018

nh2 commented Jun 15, 2018

jwiegley commented Dec 21, 2016 •

edited

Loading

steshaw commented Dec 21, 2016 •

edited

Loading

jwiegley commented Dec 21, 2016 •

edited

Loading

nh2 commented Jun 14, 2018 •

edited

Loading