Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch parser library to megaparsec #16

Closed
bennofs opened this issue Aug 25, 2014 · 35 comments
Closed

Switch parser library to megaparsec #16

bennofs opened this issue Aug 25, 2014 · 35 comments
Labels
enhancement Betterness, without new explicit feature possibilities help wanted When someone explicitly requests for help

Comments

@bennofs
Copy link
Collaborator

bennofs commented Aug 25, 2014

The parser is currently quite slow compared to the C parser:

$ time nix-instantiate --parse ~/nixpkgs/pkgs/top-level/all-packages.nix > /dev/null
0.10user 0.02system 0:00.36elapsed 33%CPU (0avgtext+0avgdata 8908maxresident)k
0inputs+0outputs (0major+2511minor)pagefaults 0swaps

compare that to hnix:

$ time dist/build/hnix/hnix ~/nixpkgs/pkgs/top-level/all-packages.nix  > /dev/null 
hnix: Operators are not yet defined

real    0m6.914s
user    0m4.056s
sys 0m0.172s

I don't know how to improve performance though.

@Mathnerd314
Copy link

Happy is much closer to the Bison parser that Nix uses, and probably of comparable speed. Does anyone mind if I switch hnix away from trifecta/parsec?

@jwiegley
Copy link
Member

I do. We just need to optimize the present parser. I think many more people are comfortable with reading and editing Parsec, than they are using Happy. So, for a speed gain, we'd reduce our pool of candidate contributors. I'm pretty sure we can get there with Trifecta, we probably just have some nasty backtracking happening in a few key places.

@Mathnerd314
Copy link

Hmm. Well, I for one am not sure how to get Parsec/Trifecta to parse the let-body syntax:
let { a = 1; body = 2; }
This is a perfectly fine Nix expression, but HNix expects it to be of the form let { ... } in ....

Happy has a GLR mode which seems to be working, although it is proving difficult to integrate it with the rest of hnix.

@cstrahan
Copy link

cstrahan commented Mar 9, 2016

Hmm. Well, I for one am not sure how to get Parsec/Trifecta to parse the let-body syntax:
let { a = 1; body = 2; }

Wow, I didn't know that was legal syntax. Crazy.

@steshaw
Copy link

steshaw commented Dec 21, 2016

The performance with respect to the C parser seems to have improved:

$ time nix-instantiate --parse ~/.nix-defexpr/channels/nixpkgs/pkgs/top-level/all-packages.nix  >/dev/null

real	0m0.139s
user	0m0.120s
sys	0m0.010s
$ time dist/build/hnix/hnix ~/.nix-defexpr/channels/nixpkgs/pkgs/top-level/all-packages.nix >/dev/null

real	0m1.306s
user	0m1.236s
sys	0m0.063s

Are further improvements expected?

@jwiegley
Copy link
Member

I certainly hope so; an order of magnitude seems a bit excessive to me. I'm not sure a proper review has been done of the back-tracking in hnix. This should be much easier now with my parsec-free library, since the evaluator can be tailored to report all backtracking.

@steshaw
Copy link

steshaw commented Dec 21, 2016

@jwiegley sounds interesting. How would I tailor parsec-free to report the backtracking?

@jwiegley
Copy link
Member

jwiegley commented Dec 21, 2016

You'll need to switch over to using parsec, which I believe is still just a matter of a flag (unless that has changed since I last looked at the parser). Then change the dependency from parsec to parsec-free. Check out the dump and look for invocations of try, or edit the evaluator in Parsec.Text.Free.Eval and modify it to print whenever a backtracking parse is attempted (you can even report how many bytes it rewound, if you annotate the details reported by attempt).

@jwiegley
Copy link
Member

Oh, and to get a dump, you'll need to use parseTestLog. See the README for more details.

@steshaw
Copy link

steshaw commented Dec 21, 2016

Thanks, @jwiegley. Things don't quite compile with -fparsec at the moment as there's an unexpected dependency on Trifecta.Delta for position. I've cobbled something together though that compiles (and doesn't error at runtime).

I'll try switching to parsec-free and parseTestLog and see how things go!

@steshaw
Copy link

steshaw commented Dec 21, 2016

I managed to get something working with parsec-free. I've run the the result on simple-pretty.nix, the output can be found in this gist. I don't rightly understand the output but I hope it's helpful (Warning: it's very long). How do you identify backtracking with parsec-free? Is it the bracketed nodes?

The code is a hack. I had to add orphan instances for CharParsing, TokenParsing and Parsing. The package requires trifecta in addition to parsec-free (due to some seeping of DeltaParsing into the code). I also had to comment out some code where I haven't worked through the type errors. Though you'd never want to merge the code, I thought it might be helpful to publish it - see my parsec-free-hack branch.

@jwiegley
Copy link
Member

jwiegley commented Dec 21, 2016

The nodes in parentheses represents failed parsing attempts, where the failure did not consume input. Only try rewinds the current position on failure, so any non-try failures mean it aborted simply by testing the current token. However, it does mean that a lot of unnecessary tests happened, depending on how deep the failure tree is.

For this very simple example there are 1,482 failed try attempts, which is excessive for a 29 line long file!

We should see if reorganizing the parser can allow us to reduce the number of failed attempts, of which there are 4,309 total for your example. I can only imagine what all-packages.nix must incur.

@domenkozar
Copy link
Collaborator

Current status:

$ time nix-instantiate --parse pkgs/top-level/all-packages.nix > /dev/null

real    0m0.065s
user    0m0.059s
sys     0m0.006s

$ time $(nix-build -A haskellPackages.hnix)/bin/hnix pkgs/top-level/all-packages.nix  >/dev/null

real    0m2.095s
user    0m1.960s
sys     0m0.091s

@jwiegley
Copy link
Member

jwiegley commented Jun 2, 2017

Yeah, that kind of slow-down is really not acceptable.

@jwiegley
Copy link
Member

jwiegley commented Apr 6, 2018

At the moment Trifecta is very slow and memory consumptive. It dominates the profiling output (though maybe @ekmett knows what we're doing wrong). Yet it gives such great error message, it would be hard to give up.

I wonder if we could use both attoparsec and Trifecta, and if parsing fails with the former, reparse with the latter for the error message.

I mention this because right now parsing the entirety of nixpkgs takes several minutes, which is unacceptable.

@jwiegley
Copy link
Member

jwiegley commented Apr 6, 2018

Here's the profiling results for all of nixpkgs. It appears that much of the slowness is due to the enormous turn-over in temporary allocations:

        Thu Apr  5 20:50 2018 Time and Allocation Profiling Report  (Final)

           hnix +RTS -p -RTS -f -

        total time  =      243.46 secs   (243463 ticks @ 1000 us, 1 processor)
        total alloc = 310,382,398,904 bytes  (excludes profiling overheads)

COST CENTRE       MODULE                                SRC                                             >

>>=.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(144,5)-(171,13)    >
satisfy.\         Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(227,5)-(235,58)    >
fmap.\            Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:108:46-67           >
<?>.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(204,41)-(206,50)   >
<$.\              Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:110:42-71           >
<|>.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:123:5-88            >
<|>.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:123:17-76           >
>>=.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(151,24)-(169,15)   >
unconsB           Codec.Binary.UTF8.Generic             Codec/Binary/UTF8/Generic.hs:297:1-18           >
>>=.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:147:16-88           >
parseFromFileEx   Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(365,1)-(367,70)    >
try.\             Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:202:45-72           >
decode            Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(68,1)-(124,32)         >
string            Text.Parser.Char                      src/Text/Parser/Char.hs:216:3-51                >
identifier        Nix.Parser.Library                    Nix/Parser/Library.hs:99:1-46                   >
decode.choose     Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(72,3)-(78,39)          >
displayIO.display Text.PrettyPrint.ANSI.Leijen.Internal Text/PrettyPrint/ANSI/Leijen/Internal.hs:(1159,7>
<?>.\.\           Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:206:13-49           >
<?>               Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(204,3)-(206,50)    >
char              Text.Parser.Char                      src/Text/Parser/Char.hs:196:3-38                >
>>=               Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(143,3)-(171,13)    >
fmap              Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:108:3-67            >
manyAccum.\.walk  Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:190:7-119           >
>>=.\.\.\         Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:158:20-51           >
uncons            Data.ByteString.UTF8                  Data/ByteString/UTF8.hs:(166,1)-(167,38)        >
<>                Text.Trifecta.Delta                   src/Text/Trifecta/Delta.hs:(119,3)-(134,101)    >
manyAccum.\       Text.Trifecta.Parser                  src/Text/Trifecta/Parser.hs:(190,3)-(193,47)    >

@jwiegley
Copy link
Member

jwiegley commented Apr 6, 2018

Here's what these numbers and errors need to be compared with, by running:

time find ~/src/nix/nixpkgs -name '*.nix' | xargs nix-instantiate --parse > /dev/null

which gives:

error: undefined variable ‘hostPlatform’ at /Users/johnw/src/nix/nixpkgs/pkgs/os-specific/windows/pthread-w32/default.nix:4:8
error: undefined variable ‘URL’ at /Users/johnw/src/nix/nixpkgs/pkgs/development/interpreters/python/cpython/docs/template.nix:8:11
error: undefined variable ‘placeholder’ at /Users/johnw/src/nix/nixpkgs/pkgs/development/libraries/pipewire/default.nix:37:46

real	0m3.158s
user	0m2.499s
sys	0m1.516s

@jwiegley
Copy link
Member

jwiegley commented Apr 6, 2018

With profiling disabled and -O2, hnix completes in 54s, so presently 17.4x slower.

@jwiegley
Copy link
Member

jwiegley commented Apr 6, 2018

@ekmett Suggests the attoparsec-then-trifecta approach, since we've already abstracted the parser using the parsers library.

@jwiegley jwiegley changed the title Improve performance of parser Switch parser library to megaparsec Apr 9, 2018
@jwiegley
Copy link
Member

jwiegley commented Apr 9, 2018

I'm actually just going to switch to megaparsec, since not only do we need speed, but we also need to be able to query the current stream position during parsing, which attoparsec does not provide.

@mightybyte
Copy link
Member

While you're at it you may want to keep in mind the error messages. I just discovered that hnix --eval data/nix/tests/lang/eval-okay-regex-split.nix gives an atrocious error message when one of the assertions in there fails (with the good implementation of builtins.split found at #154). I'm not sure if this will be covered by this parser work or if it should be a separate issue.

@jwiegley
Copy link
Member

jwiegley commented Apr 9, 2018

This work has started on the megparsec branch, and should be working early this week.

@ryantrinkle
Copy link
Member

@mightybyte I think the main issue with the error message is that the inner string needs to get putStrLned rather than printed.

@Ericson2314
Copy link
Member

I feel like @taktoa did some megaparsec thing that might be useful, but maybe that was just for Ninja.

@taktoa
Copy link
Collaborator

taktoa commented Apr 9, 2018

I've never parsed Nix with megaparsec, but NixOS/nix#1102 and https://github.com/groxxda/nixexpr may be useful to you.

@jwiegley
Copy link
Member

jwiegley commented Apr 9, 2018

@mightybyte Check our how the NixLanguageTests print their error messages. I’ll move that to Main.hs if you don’t get to it before me.

@cbarrett
Copy link

@jwiegley Let me know if there's stuff that can be done, I've worked with a number of different Parsec-inspired libraries in Haskell, so I feel like I'd be comfortable helping out some with the migration.

@jwiegley
Copy link
Member

We're now switched to megaparsec, and comparing apples to apples, we're twice as slow as nix:

time find ~/src/nix/nixpkgs -name '*.nix' | while read file ; do nix-instantiate --parse $file > /dev/null ; done
  => 0.85s user 1.70s system 1% cpu 2:29.75 total

time find ~/src/nix/nixpkgs -name '*.nix' | while read file ; do ./dist/build/hnix/hnix $file > /dev/null ; done
  => real 4m50.095s, user 1m33.585s, sys 1m15.932s

I find it interesting that Nix spends so little time doing any computing, so that the 2.5 minutes is probably being spent just dumping the pretty print results into /dev/null.

The megaparsec parser is currently slow due to <|>. I've reduced the amount of back-tracking and alternatives quite a bit, but there's still more to be done there, if others with expertise in megaparsec would like to take a look.

In terms of pure parsing only, I see the following with -O2 -fexpose-all-unfoldings:

real	0m34.172s
user	0m34.233s
sys	0m2.141s

And here are the profiling results:

hnix.prof.zip

@jwiegley
Copy link
Member

@cbarrett Would you like to do some performance work in Parser.hs? I shouldn't be modifying it too much for the rest of the week.

@cbarrett
Copy link

@jwiegley I'll take a look this week and let you know whether or not I find anything

@nh2
Copy link

nh2 commented Jun 14, 2018

@jwiegley One possible reason for slowness: Data.Text.readFile. Using ByteString.readFile and decodeUtf8' might be significantly faster, as it doesn't go thorugh the encoding-based Handle code path, which is known to be slow.

txt <- liftIO (T.readFile path)

https://www.snoyman.com/blog/2016/12/beware-of-readfile

@jwiegley
Copy link
Member

@nh2 Nice idea, but it didn't make a change in the overall speed. The profiling results show that it's due to allocation in megaparsec, rather than decoding in readFile.

@jwiegley
Copy link
Member

@nh2 And strangely enough, using Text.readFile it takes 45s to parse all of nixpkgs, but following Michael's recommended approach takes 85s.

@nh2
Copy link

nh2 commented Jun 15, 2018

@jwiegley I can't reproduce it with the following code:

module Main (main) where

import           Control.Monad (void)
import qualified Data.ByteString as BS
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.Text.IO as T
import           System.Environment (getArgs)

main :: IO ()
main = do
  args <- getArgs
  case args of
    ["bs", path] -> print =<< (fmap T.length <$> T.decodeUtf8' <$> BS.readFile path)
    ["text", path] -> print =<< (T.length <$> T.readFile path)
    _ -> error "takes 2 arguments (first is 'bs' or 'text', second is the file)"

and a 600 MB example text file:

% time ./read-performance bs verymuchjson
Right 650563173
./read-performance bs verymuchjson  0.67s user 0.54s system 99% cpu 1.209 total

% time ./read-performance text verymuchjson
650563173
./read-performance text verymuchjson  2.51s user 1.02s system 99% cpu 3.531 total

So for me T.decodeUtf8' <$> BS.readFile path is 3x faster than T.readFile.

@nh2
Copy link

nh2 commented Jun 15, 2018

@jwiegley A few more points

  • As per the blog post, T.readFile is also problematic from a correctness point of view
  • What environment are you running this in? T.readFile uses Handle's encoding guessing logic, results may vary based on your LC_* or other environment variables.
  • Can you also compare with rio's readFileUtf8?

Anton-Latukha added a commit that referenced this issue Aug 14, 2020
To reduce boilerplate and make the configuration more terse - used Travis way of providing secret key (via env), discovered that style of provisioning does not work in GitHub Cachix, it currently works only in another way, through explicitly providing Cachix GitHub Action internal option for every Cachix GitHub Action use (every build).

M  .github/workflows/Nixpkgs-Linux-additional.yml
M  .github/workflows/Nixpkgs-Linux-main.yml
M  .github/workflows/Nixpkgs-macOS.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Betterness, without new explicit feature possibilities help wanted When someone explicitly requests for help
Projects
None yet
Development

No branches or pull requests