Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

callsite union splitting #17212

Merged
merged 3 commits into from
Jul 2, 2016
Merged

callsite union splitting #17212

merged 3 commits into from
Jul 2, 2016

Conversation

vtjnash
Copy link
Sponsor Member

@vtjnash vtjnash commented Jun 30, 2016

provides a faster-path for call-site semi-stable calls.

for example, the usual test code:

julia> begin
       @noinline f(a,b) = b
       @noinline function unstable_no_alloc(n::Int)
           a = Ref(1)
           b = Ref(1.0)
           for s = 1:n
               a = f(a, b)
           end
           a
       end

       function main()
           @time unstable_no_alloc(4_600_000_000)
       end
       end
main (generic function with 1 method)

julia> main()
 39.983839 seconds (12 allocations: 736 bytes)
Base.RefValue{Float64}(1.0)

(vs. the type-stable version which takes 12.771225 sec)

@yuyichao
Copy link
Contributor

Sigh, now i need something else to actually test dynamic dispatch.

Anyway, 👍

@yuyichao
Copy link
Contributor

Looks like the getfield should be assigned to a SSAValue first in this case to avoid multiple undef ref check?

julia> @noinline f(x) = x
f (generic function with 1 method)

julia> @noinline g(x) = f(x[])
g (generic function with 1 method)

julia> @code_warntype g(Ref{Union{Int,Float64}}())
Variables:
  #self#::#g
  x::Base.RefValue{Union{Float64,Int64}}
  #temp#::LambdaInfo

Body:
  begin  # REPL[2], line 1:
      unless (Core.isa)((Core.getfield)(x::Base.RefValue{Union{Float64,Int64}},:x)::Union{Float64,Int64},Float64)::Any goto 6
      #temp#::LambdaInfo = LambdaInfo for f(::Float64)
      goto 14
      6: 
      unless (Core.isa)((Core.getfield)(x::Base.RefValue{Union{Float64,Int64}},:x)::Union{Float64,Int64},Int64)::Any goto 10
      #temp#::LambdaInfo = LambdaInfo for f(::Int64)
      goto 14
      10: 
      goto 12
      12: 
      (Base.error)("error in type inference due to #265")::Any
      14: 
      SSAValue(0) = $(Expr(:invoke, :(#temp#), :(Main.f), :((Core.getfield)(x,:x)::Union{Float64,Int64})))
      return SSAValue(0)
  end::Union{Float64,Int64}

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jun 30, 2016

yes, but that's a bug in effect_free unrelated to this PR

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jun 30, 2016

This also seems to do a nice job slaughtering a few of #16128 (although perhaps more the result of "disable gcframe rooting on return path of jlcall wrapper" than the call-site splitting)

append!(stmts, match)
if error_label !== nothing
push!(stmts, error_label)
push!(stmts, Expr(:call, GlobalRef(_topmod(sv.mod), :error), "error in type inference due to #265"))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...issue #265"? (This is effectively "jargon.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was not addressed

@timholy
Copy link
Sponsor Member

timholy commented Jun 30, 2016

Not sure I understand the code terribly well, but from the look of it, once there are more than 16 types it throws out the oldest? Are there any circumstances where this will cause a performance degradation?

@JeffBezanson
Copy link
Sponsor Member

Really cool. For me the benchmark in the OP goes from 69 seconds to 32 seconds.

  • Could the code for this not be inside inlineable? That function is just a beast.
  • I would expect it to be faster to emit multiple direct invoke sites instead of assigning a LambdaInfo to a variable and indirectly calling it.

@timholy
Copy link
Sponsor Member

timholy commented Jun 30, 2016

As another possible benchmark, here's some code that several of us were playing with at JuliaCon (CC @simonbyrne):

using BenchmarkTools

@noinline unstable(x) = x < 0.5 ? 1 : 1.0
@noinline stable(x) = x < 0.5 ? 1.0 : 1.0

@noinline convF64(x) = convert(Float64, x)  # or try @inline, both are interesting

function mysum(A)
    s = 0.0
    for a in A
        s += convF64(unstable(a))
    end
    s
end

function mysum_md(A)  # md=manual dispatch
    s = 0.0
    for a in A
        x = unstable(a)
        if isa(x, Int)
            s += convF64(x::Int)
        elseif isa(x, Float64)
            s += convF64(x::Float64)
        else
            error("oops")
        end
    end
    s
end

function mysum_stable(A)
    s = 0.0
    for a in A
        s += convF64(stable(a))
    end
    s
end

r = rand(10^4)
@show @benchmark mysum(r)
@show @benchmark mysum_md(r)
@show @benchmark mysum_stable(r)

On a recent master build:

@benchmark(mysum(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  156.27 kb
  allocs estimate:  10001
  minimum time:     288.63 μs (0.00% GC)
  median time:      311.28 μs (0.00% GC)
  mean time:        326.34 μs (1.54% GC)
  maximum time:     1.83 ms (73.83% GC)
@benchmark(mysum_md(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     102.12 μs (0.00% GC)
  median time:      110.34 μs (0.00% GC)
  mean time:        115.89 μs (0.00% GC)
  maximum time:     175.82 μs (0.00% GC)
@benchmark(mysum_stable(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     36.88 μs (0.00% GC)
  median time:      39.52 μs (0.00% GC)
  mean time:        40.16 μs (0.00% GC)
  maximum time:     55.39 μs (0.00% GC)

So the gap between type-stable and type-unstable is roughly 8x (this is already better than it used to be), and the annotated version gets you to within 2.5x of the type-stable code (and eliminates the memory allocation, which I didn't expect). Would be interesting to know whether this PR gets pretty close to that without the need to enumerate the possible types (which is a huge advantage).

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jul 1, 2016

I would expect it to be faster to emit multiple direct invoke sites instead of assigning a LambdaInfo to a variable and indirectly calling it.

The indirect call cost should be fairly minimal, and it saves us the work of linearizing the IR here. It also allows us to generate much more compact llvm code, since we only have one call site, rather than jlcall frames, etc.

For small enough branching factors, I expect that the tradeoff will be worthwhile. This is more general to handle moderate branching factors.

@vtjnash vtjnash force-pushed the jn/invoke-union-splitting branch from d2646e1 to 316bfdf Compare July 1, 2016 18:55
@vtjnash vtjnash merged commit 4823351 into master Jul 2, 2016
@vtjnash vtjnash deleted the jn/invoke-union-splitting branch July 2, 2016 23:58
@timholy
Copy link
Sponsor Member

timholy commented Jul 3, 2016

I looked at the test case above. mysum(r) drops to 200μs, better (yay!), but not as good as the manual dispatch version. The memory allocation is also still there.

@simonbyrne
Copy link
Contributor

This is pretty cool. My naive question: what is the manual dispatch version now doing that the default version is not? And what would it take to replicate that?

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jul 22, 2016

They use a different representation. This version allows us to generate smaller code, although it doesn't perform as well on synthetic benchmarks at present. I'm expecting Oscar will make a couple PRs soon enough that will close the performance gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants