Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow compilation #428

Open
jwaldmann opened this issue Aug 8, 2018 · 6 comments
Open

Slow compilation #428

jwaldmann opened this issue Aug 8, 2018 · 6 comments

Comments

@jwaldmann
Copy link

Just for reference, I noticed https://ghc.haskell.org/trac/ghc/ticket/15488 and I can confirm that compilation is slow.

This is for accelerate-1.2.0.0 (current from hackage) with ghc-8.4.3 on GNU/Linux-amd64

@tmcdonell
Copy link
Member

Hey, thanks for opening that ticket! Yeah, accelerate takes ages to compile. I have been meaning to look into it for a while but never found the time; hopefully the GHC devs will have some insight into what is causing compile times to blow up.

@tdammers
Copy link

I just spent some time looking into this from the GHC side of things; it seems that, at least in the Data.Array.Accelerate.Arrray.Analysis.Hash module, the problem boils down to excessive inlining. In a nutshell, you are asking GHC to inline functions a few levels deep into the call graph, and it obeys, and as a result, the large pattern matching constructs inside those functions multiply instead of adding up. So one thing we are currently wondering is what the rationale is behind inlining the encode... functions there.

We're also wondering whether, by any chance, you know whether this is a new problem, or one that's been around on older GHC releases.

@tmcdonell
Copy link
Member

Hey @tdammers, thanks for poking on this, I've been eagerly following the ghc ticket...

Honestly that module probably doesn't need all the inlining. I think I just read in the bytestring builder docs somewhere that for good performance you need to inline everything.

The other problematic example listed in the ticket,Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith, I think has always been slow to compile. That, and the other test files, used to live in a different package however, so you weren't hit with this every time.

@tdammers
Copy link

Honestly that module probably doesn't need all the inlining. I think I just read in the bytestring builder docs somewhere that for good performance you need to inline everything.

Hmm, I think that documentation is kind of misleading then. You certainly don't need to inline everything, just the things that compose bytestring builders, and I believe the reasoning behind that is that RULES don't work across function application boundaries, which in turn means that important bytestring builder optimization opportunities may be missed.

It's actually possible that inlining does produce runtime perfomance benefits here btw.; it's just that the price we pay for this is a massive blow-up of compiler performance, and the question is whether we are willing to pay that price. A first step towards making this decision would be to do some profiling and benchmarking to get a better idea of the tradeoff. My gut feeling, however, says that the sheer size of that code is going to hurt CPU cache performance enough to melt away all the other performance benefits.

The other problematic example listed in the ticket,Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith, I think has always been slow to compile. That, and the other test files, used to live in a different package however, so you weren't hit with this every time.

Right, I haven't actually looked at that other module - since it's only a test module, I figured it'd be less of a hindrance to users of the library, after all, most people don't run the test suite for every dependency they install.

tmcdonell added a commit to tmcdonell/accelerate that referenced this issue Sep 27, 2018
@tmcdonell
Copy link
Member

In the standard build the test files do get built, unfortunately. The test programs are exported so they can be used for the different backends (accelerate-llvm-native and accelerate-llvm-ptx) but are also used by the interpreter backend in the accelerate package itself.

Historically, when the tests were in a separate package, those tests just never got run, so I think it is important that you can stack test/cabal test immediately.

Here is the ghc-8.6.1 -v3 log for ..NoFib.Prelude.ZipWith:

click to expand:

compile: input file src/Data/Array/Accelerate/Test/NoFib/Prelude/ZipWith.hs
*** Checking old interface for Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith (use -ddump-hi-diffs for more details):
[ 78 of 114] Compiling Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith ( src/Data/Array/Accelerate/Test/NoFib/Prelude/ZipWith.hs, .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/Data/Array/Accelerate/Test/NoFib/Prelude/ZipWith.o )
*** Parser [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
!!! Parser [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 9.17 milliseconds, allocated 19.024 megabytes
*** Renamer/typechecker [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
!!! Renamer/typechecker [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 283.04 milliseconds, allocated 310.283 megabytes
*** Desugar [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Desugar (before optimization)
  = {terms: 12,270, types: 12,347, coercions: 2,690, joins: 0/2,025}
Result size of Desugar (after optimization)
  = {terms: 9,267, types: 8,868, coercions: 3,242, joins: 0/681}
!!! Desugar [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 228.74 milliseconds, allocated 230.869 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 16,094, types: 20,625, coercions: 48,938, joins: 0/2,476}
Result size of Simplifier iteration=2
  = {terms: 14,906, types: 19,883, coercions: 23,744, joins: 0/2,040}
Result size of Simplifier
  = {terms: 14,906, types: 19,883, coercions: 23,744, joins: 0/2,040}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 344.42 milliseconds, allocated 302.705 megabytes
*** Specialise [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Specialise
  = {terms: 481,287,
     types: 678,953,
     coercions: 678,040,
     joins: 0/60,122}
!!! Specialise [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 1514.55 milliseconds, allocated 1334.112 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 647,986,
     types: 1,024,213,
     coercions: 678,040,
     joins: 0/11,379}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 5841.63 milliseconds, allocated 3220.974 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 875,591,
     types: 1,730,333,
     coercions: 1,606,978,
     joins: 1,687/38,485}
Result size of Simplifier iteration=2
  = {terms: 590,375,
     types: 1,028,455,
     coercions: 935,739,
     joins: 1,790/14,760}
Result size of Simplifier iteration=3
  = {terms: 591,199,
     types: 1,023,693,
     coercions: 932,741,
     joins: 1,388/14,282}
Result size of Simplifier iteration=4
  = {terms: 591,191,
     types: 1,023,662,
     coercions: 932,706,
     joins: 1,388/14,282}
Result size of Simplifier
  = {terms: 591,191,
     types: 1,023,662,
     coercions: 932,706,
     joins: 1,388/14,282}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 31563.79 milliseconds, allocated 19407.013 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 615,440,
     types: 1,060,259,
     coercions: 995,409,
     joins: 1,790/19,007}
Result size of Simplifier iteration=2
  = {terms: 601,764,
     types: 999,776,
     coercions: 967,336,
     joins: 1,426/17,888}
Result size of Simplifier iteration=3
  = {terms: 604,759,
     types: 1,008,708,
     coercions: 964,576,
     joins: 1,430/18,968}
Result size of Simplifier iteration=4
  = {terms: 605,547,
     types: 1,014,972,
     coercions: 964,576,
     joins: 1,390/19,651}
Result size of Simplifier
  = {terms: 605,547,
     types: 1,014,972,
     coercions: 964,576,
     joins: 1,390/19,651}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 25820.76 milliseconds, allocated 16303.192 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 605,947,
     types: 1,015,795,
     coercions: 964,751,
     joins: 1,388/19,699}
Result size of Simplifier iteration=2
  = {terms: 605,972,
     types: 1,016,245,
     coercions: 964,551,
     joins: 1,388/19,724}
Result size of Simplifier iteration=3
  = {terms: 605,797,
     types: 1,015,645,
     coercions: 964,276,
     joins: 1,388/19,674}
Result size of Simplifier
  = {terms: 605,797,
     types: 1,015,645,
     coercions: 964,276,
     joins: 1,413/19,674}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 25419.31 milliseconds, allocated 14782.031 megabytes
*** Float inwards [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Float inwards
  = {terms: 605,797,
     types: 1,015,645,
     coercions: 964,276,
     joins: 1,413/19,674}
!!! Float inwards [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 6112.94 milliseconds, allocated 16185.412 megabytes
*** Called arity analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Called arity analysis
  = {terms: 605,797,
     types: 1,015,645,
     coercions: 964,276,
     joins: 1,413/19,674}
!!! Called arity analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 14633.10 milliseconds, allocated 8115.772 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 605,880,
     types: 1,015,751,
     coercions: 964,497,
     joins: 1,495/19,715}
Result size of Simplifier iteration=2
  = {terms: 605,880,
     types: 1,016,203,
     coercions: 964,206,
     joins: 1,495/19,715}
Result size of Simplifier
  = {terms: 605,880,
     types: 1,016,203,
     coercions: 964,206,
     joins: 1,495/19,715}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 18966.03 milliseconds, allocated 11089.410 megabytes
*** Demand analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Demand analysis
  = {terms: 605,880,
     types: 1,016,203,
     coercions: 964,206,
     joins: 1,495/19,715}
!!! Demand analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 2936.09 milliseconds, allocated 2799.346 megabytes
*** Worker Wrapper binds [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Worker Wrapper binds
  = {terms: 665,232,
     types: 1,119,155,
     coercions: 990,264,
     joins: 1,590/27,502}
!!! Worker Wrapper binds [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 451.66 milliseconds, allocated 174.781 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 646,033,
     types: 1,087,190,
     coercions: 1,000,774,
     joins: 1,840/23,613}
Result size of Simplifier iteration=2
  = {terms: 632,449,
     types: 1,058,998,
     coercions: 982,227,
     joins: 1,704/19,931}
Result size of Simplifier iteration=3
  = {terms: 626,136,
     types: 1,050,564,
     coercions: 977,184,
     joins: 1,503/19,534}
Result size of Simplifier
  = {terms: 626,136,
     types: 1,050,564,
     coercions: 977,184,
     joins: 1,503/19,534}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 28679.14 milliseconds, allocated 18438.441 megabytes
*** Exitification transformation [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Exitification transformation
  = {terms: 626,300,
     types: 1,051,170,
     coercions: 977,184,
     joins: 1,544/19,575}
!!! Exitification transformation [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 229.39 milliseconds, allocated 206.637 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 634,505,
     types: 1,086,893,
     coercions: 977,184,
     joins: 1,519/13,137}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 6735.63 milliseconds, allocated 3168.591 megabytes
*** Common sub-expression [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Common sub-expression
  = {terms: 444,361,
     types: 842,923,
     coercions: 678,617,
     joins: 1,262/12,872}
!!! Common sub-expression [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 1537.14 milliseconds, allocated 1019.939 megabytes
*** Float inwards [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Float inwards
  = {terms: 444,355,
     types: 842,908,
     coercions: 678,617,
     joins: 1,262/12,869}
!!! Float inwards [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 1880.23 milliseconds, allocated 4244.481 megabytes
*** Liberate case [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Liberate case
  = {terms: 510,856,
     types: 961,100,
     coercions: 690,911,
     joins: 1,932/13,539}
!!! Liberate case [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 277.10 milliseconds, allocated 254.985 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 336,614,
     types: 739,458,
     coercions: 692,160,
     joins: 1,401/13,011}
Result size of Simplifier iteration=2
  = {terms: 336,601,
     types: 739,418,
     coercions: 691,260,
     joins: 1,401/13,007}
Result size of Simplifier
  = {terms: 336,601,
     types: 739,418,
     coercions: 691,260,
     joins: 1,401/13,007}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 9392.56 milliseconds, allocated 7216.821 megabytes
*** SpecConstr [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of SpecConstr
  = {terms: 344,196,
     types: 751,125,
     coercions: 692,518,
     joins: 1,450/13,063}
!!! SpecConstr [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 882.05 milliseconds, allocated 512.742 megabytes
*** Common sub-expression [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Common sub-expression
  = {terms: 342,872,
     types: 748,896,
     coercions: 687,160,
     joins: 1,449/13,061}
!!! Common sub-expression [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 2583.39 milliseconds, allocated 589.849 megabytes
*** Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Simplifier iteration=1
  = {terms: 342,046,
     types: 747,666,
     coercions: 687,173,
     joins: 1,449/13,067}
Result size of Simplifier iteration=2
  = {terms: 334,360,
     types: 736,001,
     coercions: 685,941,
     joins: 1,400/13,005}
Result size of Simplifier
  = {terms: 334,360,
     types: 736,001,
     coercions: 685,941,
     joins: 1,400/13,005}
!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 9174.42 milliseconds, allocated 7109.842 megabytes
*** Demand analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Demand analysis
  = {terms: 334,360,
     types: 736,001,
     coercions: 685,941,
     joins: 1,400/13,005}
!!! Demand analysis [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 2110.15 milliseconds, allocated 2054.269 megabytes
*** CoreTidy [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of Tidy Core
  = {terms: 316,819,
     types: 697,238,
     coercions: 670,355,
     joins: 1,296/12,853}
!!! CoreTidy [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 3023.85 milliseconds, allocated 629.785 megabytes
writeBinIface: 501 Names
writeBinIface: 1112 dict entries
writeBinIface: 501 Names
writeBinIface: 1112 dict entries
*** CorePrep [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of CorePrep
  = {terms: 370,317,
     types: 765,281,
     coercions: 670,355,
     joins: 1,296/32,471}
!!! CorePrep [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 1115.43 milliseconds, allocated 875.279 megabytes
*** Stg2Stg:
*** CodeGen [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
!!! CodeGen [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 13447.09 milliseconds, allocated 15814.518 megabytes
*** Assembler:
gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE -iquotesrc/Data/Array/Accelerate/Test/NoFib/Prelude -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/autogen -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/global-autogen -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -fno-common -U__PIC__ -D__PIC__ -Qunused-arguments -x assembler -c /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_414.s -o .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/Data/Array/Accelerate/Test/NoFib/Prelude/ZipWith.o
*** CorePrep [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
Result size of CorePrep
  = {terms: 370,317,
     types: 765,281,
     coercions: 670,355,
     joins: 1,296/32,471}
!!! CorePrep [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 1116.92 milliseconds, allocated 860.074 megabytes
*** Stg2Stg:
*** CodeGen [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]:
!!! CodeGen [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 16190.13 milliseconds, allocated 16366.186 megabytes
*** Assembler:
gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE -iquotesrc/Data/Array/Accelerate/Test/NoFib/Prelude -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/autogen -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/global-autogen -I.stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build -fno-common -U__PIC__ -D__PIC__ -Qunused-arguments -x assembler -c /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_416.s -o .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/Data/Array/Accelerate/Test/NoFib/Prelude/ZipWith.dyn_o
*** Deleting temp files:
Deleting: /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_413.s /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_414.s /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_415.c /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_416.s /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_417.c
Warning: deleting non-existent /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_413.s
Warning: deleting non-existent /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_415.c
Warning: deleting non-existent /var/folders/0y/t7qfzn8j1y7dlfhjkvry2_hh0000gn/T/ghc85008_0/ghc_417.c

@tdammers
Copy link

Here is the ghc-8.6.1 -v3 log for ..NoFib.Prelude.ZipWith

Great, thanks for that! This one looks like there could actually be more going on than just literally obeying INLINE pragmas - there's an awful lot of joins too, which could hint at some other optimization triggering the blowup.

@tmcdonell tmcdonell changed the title slow compilation with ghc-8.4.3 Slow compilation Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants