callsite union splitting #17212

vtjnash · 2016-06-30T16:05:40Z

provides a faster-path for call-site semi-stable calls.

for example, the usual test code:

julia> begin
       @noinline f(a,b) = b
       @noinline function unstable_no_alloc(n::Int)
           a = Ref(1)
           b = Ref(1.0)
           for s = 1:n
               a = f(a, b)
           end
           a
       end

       function main()
           @time unstable_no_alloc(4_600_000_000)
       end
       end
main (generic function with 1 method)

julia> main()
 39.983839 seconds (12 allocations: 736 bytes)
Base.RefValue{Float64}(1.0)

(vs. the type-stable version which takes 12.771225 sec)

yuyichao · 2016-06-30T16:09:01Z

Sigh, now i need something else to actually test dynamic dispatch.

Anyway, 👍

yuyichao · 2016-06-30T16:25:27Z

Looks like the getfield should be assigned to a SSAValue first in this case to avoid multiple undef ref check?

julia> @noinline f(x) = x
f (generic function with 1 method)

julia> @noinline g(x) = f(x[])
g (generic function with 1 method)

julia> @code_warntype g(Ref{Union{Int,Float64}}())
Variables:
  #self#::#g
  x::Base.RefValue{Union{Float64,Int64}}
  #temp#::LambdaInfo

Body:
  begin  # REPL[2], line 1:
      unless (Core.isa)((Core.getfield)(x::Base.RefValue{Union{Float64,Int64}},:x)::Union{Float64,Int64},Float64)::Any goto 6
      #temp#::LambdaInfo = LambdaInfo for f(::Float64)
      goto 14
      6: 
      unless (Core.isa)((Core.getfield)(x::Base.RefValue{Union{Float64,Int64}},:x)::Union{Float64,Int64},Int64)::Any goto 10
      #temp#::LambdaInfo = LambdaInfo for f(::Int64)
      goto 14
      10: 
      goto 12
      12: 
      (Base.error)("error in type inference due to #265")::Any
      14: 
      SSAValue(0) = $(Expr(:invoke, :(#temp#), :(Main.f), :((Core.getfield)(x,:x)::Union{Float64,Int64})))
      return SSAValue(0)
  end::Union{Float64,Int64}

vtjnash · 2016-06-30T16:35:36Z

yes, but that's a bug in effect_free unrelated to this PR

vtjnash · 2016-06-30T16:39:10Z

This also seems to do a nice job slaughtering a few of #16128 (although perhaps more the result of "disable gcframe rooting on return path of jlcall wrapper" than the call-site splitting)

timholy · 2016-06-30T16:53:22Z

base/inference.jl

+                append!(stmts, match)
+                if error_label !== nothing
+                    push!(stmts, error_label)
+                    push!(stmts, Expr(:call, GlobalRef(_topmod(sv.mod), :error), "error in type inference due to #265"))


"...issue #265"? (This is effectively "jargon.")

was not addressed

timholy · 2016-06-30T16:54:38Z

Not sure I understand the code terribly well, but from the look of it, once there are more than 16 types it throws out the oldest? Are there any circumstances where this will cause a performance degradation?

JeffBezanson · 2016-06-30T19:13:18Z

Really cool. For me the benchmark in the OP goes from 69 seconds to 32 seconds.

Could the code for this not be inside inlineable? That function is just a beast.
I would expect it to be faster to emit multiple direct invoke sites instead of assigning a LambdaInfo to a variable and indirectly calling it.

timholy · 2016-06-30T19:31:23Z

As another possible benchmark, here's some code that several of us were playing with at JuliaCon (CC @simonbyrne):

using BenchmarkTools

@noinline unstable(x) = x < 0.5 ? 1 : 1.0
@noinline stable(x) = x < 0.5 ? 1.0 : 1.0

@noinline convF64(x) = convert(Float64, x)  # or try @inline, both are interesting

function mysum(A)
    s = 0.0
    for a in A
        s += convF64(unstable(a))
    end
    s
end

function mysum_md(A)  # md=manual dispatch
    s = 0.0
    for a in A
        x = unstable(a)
        if isa(x, Int)
            s += convF64(x::Int)
        elseif isa(x, Float64)
            s += convF64(x::Float64)
        else
            error("oops")
        end
    end
    s
end

function mysum_stable(A)
    s = 0.0
    for a in A
        s += convF64(stable(a))
    end
    s
end

r = rand(10^4)
@show @benchmark mysum(r)
@show @benchmark mysum_md(r)
@show @benchmark mysum_stable(r)

On a recent master build:

@benchmark(mysum(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  156.27 kb
  allocs estimate:  10001
  minimum time:     288.63 μs (0.00% GC)
  median time:      311.28 μs (0.00% GC)
  mean time:        326.34 μs (1.54% GC)
  maximum time:     1.83 ms (73.83% GC)
@benchmark(mysum_md(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     102.12 μs (0.00% GC)
  median time:      110.34 μs (0.00% GC)
  mean time:        115.89 μs (0.00% GC)
  maximum time:     175.82 μs (0.00% GC)
@benchmark(mysum_stable(r)) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     36.88 μs (0.00% GC)
  median time:      39.52 μs (0.00% GC)
  mean time:        40.16 μs (0.00% GC)
  maximum time:     55.39 μs (0.00% GC)

So the gap between type-stable and type-unstable is roughly 8x (this is already better than it used to be), and the annotated version gets you to within 2.5x of the type-stable code (and eliminates the memory allocation, which I didn't expect). Would be interesting to know whether this PR gets pretty close to that without the need to enumerate the possible types (which is a huge advantage).

vtjnash · 2016-07-01T18:50:36Z

I would expect it to be faster to emit multiple direct invoke sites instead of assigning a LambdaInfo to a variable and indirectly calling it.

The indirect call cost should be fairly minimal, and it saves us the work of linearizing the IR here. It also allows us to generate much more compact llvm code, since we only have one call site, rather than jlcall frames, etc.

For small enough branching factors, I expect that the tradeoff will be worthwhile. This is more general to handle moderate branching factors.

timholy · 2016-07-03T14:05:03Z

I looked at the test case above. mysum(r) drops to 200μs, better (yay!), but not as good as the manual dispatch version. The memory allocation is also still there.

simonbyrne · 2016-07-22T17:53:01Z

This is pretty cool. My naive question: what is the manual dispatch version now doing that the default version is not? And what would it take to replicate that?

vtjnash · 2016-07-22T18:14:15Z

They use a different representation. This version allows us to generate smaller code, although it doesn't perform as well on synthetic benchmarks at present. I'm expecting Oscar will make a couple PRs soon enough that will close the performance gap.

timholy reviewed Jun 30, 2016
View reviewed changes

vtjnash added 3 commits July 1, 2016 14:55

implement call-site union splitting using invoke expr

784c34b

add jl_gdblookuplinfo lookup helper for debugging

9d2955c

disable gcframe rooting on return path of jlcall wrapper

316bfdf

vtjnash force-pushed the jn/invoke-union-splitting branch from d2646e1 to 316bfdf Compare July 1, 2016 18:55

vtjnash merged commit 4823351 into master Jul 2, 2016

vtjnash deleted the jn/invoke-union-splitting branch July 2, 2016 23:58

tkelman mentioned this pull request Aug 10, 2016

0.5: error in type inference due to #265 #17932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

callsite union splitting #17212

callsite union splitting #17212

vtjnash commented Jun 30, 2016

yuyichao commented Jun 30, 2016

yuyichao commented Jun 30, 2016

vtjnash commented Jun 30, 2016

vtjnash commented Jun 30, 2016 •

edited

Loading

timholy Jun 30, 2016

tkelman Jul 3, 2016

timholy commented Jun 30, 2016

JeffBezanson commented Jun 30, 2016

timholy commented Jun 30, 2016

vtjnash commented Jul 1, 2016

timholy commented Jul 3, 2016

simonbyrne commented Jul 22, 2016

vtjnash commented Jul 22, 2016

callsite union splitting #17212

callsite union splitting #17212

Conversation

vtjnash commented Jun 30, 2016

yuyichao commented Jun 30, 2016

yuyichao commented Jun 30, 2016

vtjnash commented Jun 30, 2016

vtjnash commented Jun 30, 2016 • edited Loading

timholy Jun 30, 2016

Choose a reason for hiding this comment

tkelman Jul 3, 2016

Choose a reason for hiding this comment

timholy commented Jun 30, 2016

JeffBezanson commented Jun 30, 2016

timholy commented Jun 30, 2016

vtjnash commented Jul 1, 2016

timholy commented Jul 3, 2016

simonbyrne commented Jul 22, 2016

vtjnash commented Jul 22, 2016

vtjnash commented Jun 30, 2016 •

edited

Loading