Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add onepass algorithm for logsumexp #97

Merged
merged 14 commits into from
Sep 23, 2020

Conversation

devmotion
Copy link
Member

@devmotion devmotion commented Jun 8, 2020

This PR fixes #91.

I used the streaming algorithm for the computation of logsumexp described in http://www.nowozin.net/sebastian/blog/streaming-log-sum-exp-computation.html in one of my projects lately, and I noticed that there is already an open issue for adding it to StatsFuns.

I reran the benchmark mentioned in the issue and got:

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 10_000_000;

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  94.999 ms (0 allocations: 0 bytes)
2697.6523838563817

julia> @btime StatsFuns.logsumexp_onepass($X)
  32.815 ms (0 allocations: 0 bytes)
2697.6523838563817

julia> f(X) = reduce(logaddexp, X)
f (generic function with 1 method)

julia> @btime f($X)
  162.519 ms (0 allocations: 0 bytes)
2697.6523838563817

Currently logsumexp_onepass is used as the fallback for logsumexp but not exported. I'm not sure if it should be exported as well. Additionally, maybe it would be beneficial to use the one-pass algorithms even for arrays above a certain size, since it seems to perform better in the benchmark above. Some additional benchmarks suggest that this might be the case even for smaller arrays:

julia> n = 10;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  84.612 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> @btime StatsFuns.logsumexp_onepass($X)
  87.535 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> @btime f($X)
  184.664 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> n = 100;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  633.000 ns (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime StatsFuns.logsumexp_onepass($X)
  654.025 ns (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime f($X)
  1.337 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> n = 1_000;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  6.698 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime StatsFuns.logsumexp_onepass($X)
  3.854 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime f($X)
  11.460 μs (0 allocations: 0 bytes)
1786.3178411042175

Edit: This PR also fixes the example in #63:

julia> logsumexp(i for i in range(-500, stop = 10, length = 1000) if true) - logsumexp(range(-500, stop = 10, length = 1000))
0.0

Additionally, it addresses (and fixes?) #75 (but maybe we would still want the two-pass algorithm instead?).

@ararslan
Copy link
Member

ararslan commented Jun 8, 2020

Seems like we could just use this algorithm for logsumexp instead of what's currently there without needing a separate function for it. The benchmarks you posted seem to suggest that would be beneficial; any reason not to do that?

@devmotion
Copy link
Member Author

devmotion commented Jun 8, 2020

The algorithm doesn't handle reductions along certain dimensions, so I guess it would only replace the case with dims = : in a straightforward way. Additionally, it seems that for n = 100 elements the two-pass algorithm is still slightly faster (it seems the one-pass algorithm takes over at some point between 100 and 1000 elements in my benchmarks).

@ararslan ararslan requested a review from simonbyrne June 8, 2020 19:59
@devmotion
Copy link
Member Author

Bump :)

Copy link
Contributor

@dmbates dmbates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

One thing I did notice in the checks, not specific to this PR, is that the package is only being tested on 1.0, 1.3 and nightly. Perhaps the Travis and Appveyor configurations should be updated.

@devmotion
Copy link
Member Author

I addressed this issue in a separate PR.

@dmbates
Copy link
Contributor

dmbates commented Aug 19, 2020

I addressed this issue in a separate PR.

Thank you.

@nalimilan
Copy link
Member

nalimilan commented Sep 1, 2020

AFAICT your algorithm is only very slightly slower for small arrays, and much faster for large arrays, right? So better use it instead of the existing reduce(logaddexp, X). (We don't usually provide two functions for different algorithms doing the same thing.)

EDIT: I hadn't realized logsumexp_onepass wasn't exported. Then better not add that function and move its code to the body of the logsumexp fallback method.

src/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

@simonbyrne wrote the previous implementation at #45, so I'd like him to confirm it's OK, in particular for custom arrays like CuArrays, where reduce might have been faster.

I also wonder why reduce(logaddexp, X) is so slow.

@mileslucas mileslucas mentioned this pull request Sep 1, 2020
@devmotion
Copy link
Member Author

I also wonder why reduce(logaddexp, X) is so slow.

reduce(logaddexp, X) calls log1pexp in every step, which in the example above should call log1p(exp(x)) in roughly half of the steps. Surely this will be computationally more expensive than just calling exp(...) in every step (and one additional log(...) call in the end).

@devmotion
Copy link
Member Author

devmotion commented Sep 3, 2020

@simonbyrne wrote the previous implementation at #45, so I'd like him to confirm it's OK, in particular for custom arrays like CuArrays, where reduce might have been faster.

Calling logsumexp on a CuArray will not fall back to the reduce version but rather on the version that is implemented for AbstractArray. In #45, the reduce version was just added as a generic fallback but it seems the main change to enable the use of CuArrays was to remove the indexing from the implementation for AbstractArray. Hence I'm not convinced that logsumexp_onepass should be used as a replacement for logsumexp(x::AbstractArray; dims = :).

The one-pass algorithm could be used as a replacement for the generic reduce fallback without defining an explicit logsumexp_onepass function, but then it would not be possible to use it with AbstractArrays.

Therefore I created a separate logsumexp_onepass function that is used as the generic fallback of logsumexp.

Edit: If it is desired we could dispatch logsumexp(x::Array; dims = :) (only for Arrays) to the logsumexp_onepass function but still keep the current implementation of logsumexp for general array types.

Edit 2: Alternatively, one could use the ArrayInterface.fast_scalar_indexing trait to determine if indexing is fast, and only dispatch to logsumexp_onepass in this case.

@nalimilan
Copy link
Member

Unfortunately, ArrayInterface currently depends on Requires, which increases the load time significantly and is therefore probably not OK for this package. We really need this kind of traits in a lightweight package containing only basic definitions. I guess in the meantime is would be OK to use logsumexp_onepass for Array only, and leave a TODO.

Though another more interesting solution would be to adapt the one-pass algorithm to use mapreduce instead. That way, the optimized CuArray methods (which distribute work in parallel over multiple blocks) would be used, and we would get the best of both worlds. AFAICT this should be possible, as the operation applied to each element is associative, right?

@devmotion
Copy link
Member Author

Though another more interesting solution would be to adapt the one-pass algorithm to use mapreduce instead. That way, the optimized CuArray methods (which distribute work in parallel over multiple blocks) would be used, and we would get the best of both worlds. AFAICT this should be possible, as the operation applied to each element is associative, right?

Hmm I don't see how this would be possible with mapreduce: the transformation applied to each element depends on the result of the reduction so far, so it is not clear to me how one could define a function of the current element only for the map part of mapreduce. AFAICT it is possible to rewrite the implementation using foldl or foldr. However, since every reduction step results in a tuple of the current maximum and the sum of the exponentials of all elements so far, divided by the exponential of the largest element so far, the reduction is not associative (for foldl the reduction has the signature f(::Tuple{<:Real,<:Real}, ::Real) whereas for foldr it is f(::Real, ::Tuple{<:Real,<:Real})), and hence one can't use reduce.

@nalimilan
Copy link
Member

Yeah, the algorithm currently can only handle one new element at a time. But I wonder whether it would be possible to use a trick in order to keep track of whether the reduction operation is currently working on a single element or on a partial sum of multiple elements. Something like

mapreduce(x -> (x, 1), ((x, i), (y, j)) -> (i == 1 && j == 1 ? : ... : (i == 1 ? ... : (j == 1 ? ... : ...)), i+j), v)

AFAICT that would work if we have a formula to combine two partial sums into a single one. Is that the case? :-)

If that doesn't work, let's go with the special-casing of Array.

@devmotion
Copy link
Member Author

Yes, I guess that could work 👍 The reduction of two partial sums seems to be very similar to the currently implemented version AFAICT. I'll see if it works and benchmark both variants.

@devmotion
Copy link
Member Author

devmotion commented Sep 4, 2020

Hmm it works mathematically but it seems mapreduce is quite slow:

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 10_000_000;

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  106.007 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> @btime StatsFuns.logsumexp_onepass($X)
  42.388 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> @btime StatsFuns.logsumexp_onepass_reduce($X)
  74.562 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> f(X) = reduce(logaddexp, X)
f (generic function with 1 method)

julia> @btime f($X)
  163.131 ms (0 allocations: 0 bytes)
2570.6633157197443

The implementation should be quite optimized and is very similar to the one of logsumexp_onepass:

function logsumexp_onepass_reduction((xmax1, r1)::T, (xmax2, r2)::T) where {T<:Tuple}
    if xmax1 < xmax2
        xmax = xmax2
        r = r2 + r1 * exp(xmax1 - xmax2)
    elseif xmax1 > xmax2
        xmax = xmax1
        r = r1 + r2 * exp(xmax2 - xmax1)
    else # ensure finite values if x = xmax = ± Inf
        xmax = xmax1
        r = r1 + r2
    end

    return xmax, r
end

function logsumexp_onepass_reduce(X)
    isempty(X) && return log(sum(X))

    xmax, r = mapreduce(logsumexp_onepass_reduction, X) do x
        x, float(one(x))
    end

    return xmax + log(r)
end

@nalimilan
Copy link
Member

Ah, too bad. Though it's a bit faster than the current implementation. I wonder what would be the result for CuArray. mapreduce also has the potential of using multiple threads once that support is added.

Have you tried forcing @inbounds just in case?

@devmotion
Copy link
Member Author

Have you tried forcing @inbounds just in case?

There's no need for @inbounds in the mapreduce version it seems. I updated the comment above with the implementation that I tested.

@nalimilan
Copy link
Member

Sorry I meant @inline.

@devmotion
Copy link
Member Author

mapreduce also has the potential of using multiple threads once that support is added.

Yeah I agree, that would be great. Also it seems by reducing the partial sums it would be trivial to implement logsumexp with (parallel) transducers.

@devmotion
Copy link
Member Author

Unfortunately, the timings are the same even if I add an @inline to the reduction function.

@nalimilan
Copy link
Member

OK, too bad. Maybe it would be faster if you passed i and j as in my example so that you can identify the most common case where one of the argument is a single value. Then in that case you would be able to do if x < xmax; r += exp(x - xmax) as in logsumexp_onepass. Indeed mapreduce IIRC is going to call op(r, f(x[i])) most of the time (as opposed to combining two partial sums, which will only happen once for each block), so that's where performance matters and getting rid of one multiply operation should make a difference.

src/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated Show resolved Hide resolved
test/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated
return xmax + log(r)
end

# with initial element: required by CUDA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's also required by Array when the input is empty, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, empty inputs are handled by log(sum(X)), as before. Actually we never want to use this method apart for GPU arrays - the type calculations are doomed to fail e.g. for not concretely typed arrays and hence it is much safer to not rely on them and instead just compute the output. I'm not sure why the type calculations in https://github.com/JuliaGPU/GPUArrays.jl/blob/356c5fe3f83f76f8139b4edf5536cbd08d48da7f/src/host/mapreduce.jl#L49-L52 fail, I can check if this can be fixed from our side. Alternatively, maybe another package such as NNlib should provide a GPU-compatible overload of StatsFuns.logsumexp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. They fail because neutral_element is only defined for a few particular functions. Another approach would be to use the type of the first element, since we already handle the case where the array is empty. That way we wouldn't need two different methods depending on whether the eltype is known.

Given that it seems doable, it would be nice to have something that works for all arrays without another package having to overload the method.

Copy link
Member Author

@devmotion devmotion Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is just bad that GPUArrays requires us to provide an initial or neutral element. We really never want to do that, regardless if there is a first element or not. Retrieving the first element is bad for any stateful iterator, and is a scalar indexing operation on the GPU (which leads to warnings when calling the function - and we can not remove them without depending on CUDA). Additionally, the docstring of mapreduce says that

It is unspecified whether init is used for non-empty collections.

so there is really no point in providing init here - apart from GPU compatibility. The problem is that we can not restrict our implementation with the init hack to only GPU arrays without taking a dependency on CUDA. Hence I don't think it is completely unreasonable to provide a GPU-compatible implementation somewhere else. An alternative would be to not use the one-pass algorithm as the default algorithm (for arrays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieving the first element should be OK for any AbstractArray (contrary to some iterables). It's too bad that CUDA prints warnings in that case.

Let's keep what you have now: since the method only accepts arrays with T<:Real element type, float(T) will (almost) certainly work.

@nalimilan nalimilan linked an issue Sep 5, 2020 that may be closed by this pull request
src/basicfuns.jl Outdated Show resolved Hide resolved
test/basicfuns.jl Outdated Show resolved Hide resolved
test/basicfuns.jl Outdated Show resolved Hide resolved
src/basicfuns.jl Outdated
return xmax + log(r)
end

# with initial element: required by CUDA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieving the first element should be OK for any AbstractArray (contrary to some iterables). It's too bad that CUDA prints warnings in that case.

Let's keep what you have now: since the method only accepts arrays with T<:Real element type, float(T) will (almost) certainly work.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few remaining details.

src/basicfuns.jl Outdated Show resolved Hide resolved
devmotion and others added 3 commits September 6, 2020 12:32
src/basicfuns.jl Outdated Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Looks good if you confirm the latest version is as fast as the previous one. I have two stylistic comments.

src/basicfuns.jl Show resolved Hide resolved
src/basicfuns.jl Outdated Show resolved Hide resolved
@devmotion
Copy link
Member Author

Same benchmark as above gives now

julia> @btime StatsFuns.logsumexp($X)
  37.203 ms (0 allocations: 0 bytes)
2570.6633157197443

on my computer.

@devmotion
Copy link
Member Author

For the calls with dims I ran

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 100_000;

julia> X = 500 .* randn(n, 100);

and got on this PR

julia> @btime StatsFuns.logsumexp($X; dims=2);
  99.331 ms (8 allocations: 2.29 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=1);
  63.163 ms (2 allocations: 2.64 KiB)

julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
  66.100 ms (6 allocations: 352 bytes)

whereas master yields

julia> @btime StatsFuns.logsumexp($X; dims=2);
  117.644 ms (16 allocations: 78.58 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=1);
  86.431 ms (5 allocations: 76.30 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
  72.502 ms (13 allocations: 76.29 MiB)

@nalimilan nalimilan merged commit 59989e1 into JuliaStats:master Sep 23, 2020
@nalimilan
Copy link
Member

Thanks @devmotion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

logsumexp for streaming data logsumexp for Tuples
4 participants