Add onepass algorithm for logsumexp #97

devmotion · 2020-06-08T19:23:45Z

This PR fixes #91.

I used the streaming algorithm for the computation of logsumexp described in http://www.nowozin.net/sebastian/blog/streaming-log-sum-exp-computation.html in one of my projects lately, and I noticed that there is already an open issue for adding it to StatsFuns.

I reran the benchmark mentioned in the issue and got:

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 10_000_000;

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  94.999 ms (0 allocations: 0 bytes)
2697.6523838563817

julia> @btime StatsFuns.logsumexp_onepass($X)
  32.815 ms (0 allocations: 0 bytes)
2697.6523838563817

julia> f(X) = reduce(logaddexp, X)
f (generic function with 1 method)

julia> @btime f($X)
  162.519 ms (0 allocations: 0 bytes)
2697.6523838563817

Currently logsumexp_onepass is used as the fallback for logsumexp but not exported. I'm not sure if it should be exported as well. Additionally, maybe it would be beneficial to use the one-pass algorithms even for arrays above a certain size, since it seems to perform better in the benchmark above. Some additional benchmarks suggest that this might be the case even for smaller arrays:

julia> n = 10;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  84.612 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> @btime StatsFuns.logsumexp_onepass($X)
  87.535 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> @btime f($X)
  184.664 ns (0 allocations: 0 bytes)
1016.2819596629249

julia> n = 100;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  633.000 ns (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime StatsFuns.logsumexp_onepass($X)
  654.025 ns (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime f($X)
  1.337 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> n = 1_000;

julia> Random.seed!(100);

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  6.698 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime StatsFuns.logsumexp_onepass($X)
  3.854 μs (0 allocations: 0 bytes)
1786.3178411042175

julia> @btime f($X)
  11.460 μs (0 allocations: 0 bytes)
1786.3178411042175

Edit: This PR also fixes the example in #63:

julia> logsumexp(i for i in range(-500, stop = 10, length = 1000) if true) - logsumexp(range(-500, stop = 10, length = 1000))
0.0

Additionally, it addresses (and fixes?) #75 (but maybe we would still want the two-pass algorithm instead?).

ararslan · 2020-06-08T19:29:46Z

Seems like we could just use this algorithm for logsumexp instead of what's currently there without needing a separate function for it. The benchmarks you posted seem to suggest that would be beneficial; any reason not to do that?

devmotion · 2020-06-08T19:33:12Z

The algorithm doesn't handle reductions along certain dimensions, so I guess it would only replace the case with dims = : in a straightforward way. Additionally, it seems that for n = 100 elements the two-pass algorithm is still slightly faster (it seems the one-pass algorithm takes over at some point between 100 and 1000 elements in my benchmarks).

devmotion · 2020-08-09T20:19:36Z

Bump :)

dmbates

Looks good to me.

One thing I did notice in the checks, not specific to this PR, is that the package is only being tested on 1.0, 1.3 and nightly. Perhaps the Travis and Appveyor configurations should be updated.

devmotion · 2020-08-17T14:48:02Z

I addressed this issue in a separate PR.

dmbates · 2020-08-19T13:38:52Z

I addressed this issue in a separate PR.

Thank you.

nalimilan · 2020-09-01T16:41:38Z

AFAICT your algorithm is only very slightly slower for small arrays, and much faster for large arrays, right? So better use it instead of the existing reduce(logaddexp, X). (We don't usually provide two functions for different algorithms doing the same thing.)

EDIT: I hadn't realized logsumexp_onepass wasn't exported. Then better not add that function and move its code to the body of the logsumexp fallback method.

src/basicfuns.jl

nalimilan · 2020-09-01T17:20:01Z

@simonbyrne wrote the previous implementation at #45, so I'd like him to confirm it's OK, in particular for custom arrays like CuArrays, where reduce might have been faster.

I also wonder why reduce(logaddexp, X) is so slow.

devmotion · 2020-09-02T09:07:24Z

I also wonder why reduce(logaddexp, X) is so slow.

reduce(logaddexp, X) calls log1pexp in every step, which in the example above should call log1p(exp(x)) in roughly half of the steps. Surely this will be computationally more expensive than just calling exp(...) in every step (and one additional log(...) call in the end).

devmotion · 2020-09-03T09:45:16Z

@simonbyrne wrote the previous implementation at #45, so I'd like him to confirm it's OK, in particular for custom arrays like CuArrays, where reduce might have been faster.

Calling logsumexp on a CuArray will not fall back to the reduce version but rather on the version that is implemented for AbstractArray. In #45, the reduce version was just added as a generic fallback but it seems the main change to enable the use of CuArrays was to remove the indexing from the implementation for AbstractArray. Hence I'm not convinced that logsumexp_onepass should be used as a replacement for logsumexp(x::AbstractArray; dims = :).

The one-pass algorithm could be used as a replacement for the generic reduce fallback without defining an explicit logsumexp_onepass function, but then it would not be possible to use it with AbstractArrays.

Therefore I created a separate logsumexp_onepass function that is used as the generic fallback of logsumexp.

Edit: If it is desired we could dispatch logsumexp(x::Array; dims = :) (only for Arrays) to the logsumexp_onepass function but still keep the current implementation of logsumexp for general array types.

Edit 2: Alternatively, one could use the ArrayInterface.fast_scalar_indexing trait to determine if indexing is fast, and only dispatch to logsumexp_onepass in this case.

nalimilan · 2020-09-03T12:29:38Z

Unfortunately, ArrayInterface currently depends on Requires, which increases the load time significantly and is therefore probably not OK for this package. We really need this kind of traits in a lightweight package containing only basic definitions. I guess in the meantime is would be OK to use logsumexp_onepass for Array only, and leave a TODO.

Though another more interesting solution would be to adapt the one-pass algorithm to use mapreduce instead. That way, the optimized CuArray methods (which distribute work in parallel over multiple blocks) would be used, and we would get the best of both worlds. AFAICT this should be possible, as the operation applied to each element is associative, right?

devmotion · 2020-09-03T22:37:47Z

Though another more interesting solution would be to adapt the one-pass algorithm to use mapreduce instead. That way, the optimized CuArray methods (which distribute work in parallel over multiple blocks) would be used, and we would get the best of both worlds. AFAICT this should be possible, as the operation applied to each element is associative, right?

Hmm I don't see how this would be possible with mapreduce: the transformation applied to each element depends on the result of the reduction so far, so it is not clear to me how one could define a function of the current element only for the map part of mapreduce. AFAICT it is possible to rewrite the implementation using foldl or foldr. However, since every reduction step results in a tuple of the current maximum and the sum of the exponentials of all elements so far, divided by the exponential of the largest element so far, the reduction is not associative (for foldl the reduction has the signature f(::Tuple{<:Real,<:Real}, ::Real) whereas for foldr it is f(::Real, ::Tuple{<:Real,<:Real})), and hence one can't use reduce.

nalimilan · 2020-09-04T08:14:08Z

Yeah, the algorithm currently can only handle one new element at a time. But I wonder whether it would be possible to use a trick in order to keep track of whether the reduction operation is currently working on a single element or on a partial sum of multiple elements. Something like

mapreduce(x -> (x, 1), ((x, i), (y, j)) -> (i == 1 && j == 1 ? : ... : (i == 1 ? ... : (j == 1 ? ... : ...)), i+j), v)

AFAICT that would work if we have a formula to combine two partial sums into a single one. Is that the case? :-)

If that doesn't work, let's go with the special-casing of Array.

devmotion · 2020-09-04T08:33:29Z

Yes, I guess that could work 👍 The reduction of two partial sums seems to be very similar to the currently implemented version AFAICT. I'll see if it works and benchmark both variants.

devmotion · 2020-09-04T09:02:36Z

Hmm it works mathematically but it seems mapreduce is quite slow:

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 10_000_000;

julia> X = 500 .* randn(n);

julia> @btime logsumexp($X)
  106.007 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> @btime StatsFuns.logsumexp_onepass($X)
  42.388 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> @btime StatsFuns.logsumexp_onepass_reduce($X)
  74.562 ms (0 allocations: 0 bytes)
2570.6633157197443

julia> f(X) = reduce(logaddexp, X)
f (generic function with 1 method)

julia> @btime f($X)
  163.131 ms (0 allocations: 0 bytes)
2570.6633157197443

The implementation should be quite optimized and is very similar to the one of logsumexp_onepass:

function logsumexp_onepass_reduction((xmax1, r1)::T, (xmax2, r2)::T) where {T<:Tuple}
    if xmax1 < xmax2
        xmax = xmax2
        r = r2 + r1 * exp(xmax1 - xmax2)
    elseif xmax1 > xmax2
        xmax = xmax1
        r = r1 + r2 * exp(xmax2 - xmax1)
    else # ensure finite values if x = xmax = ± Inf
        xmax = xmax1
        r = r1 + r2
    end

    return xmax, r
end

function logsumexp_onepass_reduce(X)
    isempty(X) && return log(sum(X))

    xmax, r = mapreduce(logsumexp_onepass_reduction, X) do x
        x, float(one(x))
    end

    return xmax + log(r)
end

nalimilan · 2020-09-04T09:08:20Z

Ah, too bad. Though it's a bit faster than the current implementation. I wonder what would be the result for CuArray. mapreduce also has the potential of using multiple threads once that support is added.

Have you tried forcing @inbounds just in case?

devmotion · 2020-09-04T09:10:33Z

Have you tried forcing @inbounds just in case?

There's no need for @inbounds in the mapreduce version it seems. I updated the comment above with the implementation that I tested.

nalimilan · 2020-09-04T09:13:17Z

Sorry I meant @inline.

devmotion · 2020-09-04T09:14:47Z

mapreduce also has the potential of using multiple threads once that support is added.

Yeah I agree, that would be great. Also it seems by reducing the partial sums it would be trivial to implement logsumexp with (parallel) transducers.

devmotion · 2020-09-04T09:17:37Z

Unfortunately, the timings are the same even if I add an @inline to the reduction function.

nalimilan · 2020-09-04T09:24:47Z

OK, too bad. Maybe it would be faster if you passed i and j as in my example so that you can identify the most common case where one of the argument is a single value. Then in that case you would be able to do if x < xmax; r += exp(x - xmax) as in logsumexp_onepass. Indeed mapreduce IIRC is going to call op(r, f(x[i])) most of the time (as opposed to combining two partial sums, which will only happen once for each block), so that's where performance matters and getting rid of one multiply operation should make a difference.

src/basicfuns.jl

test/basicfuns.jl

src/basicfuns.jl

nalimilan · 2020-09-05T10:22:30Z

src/basicfuns.jl

+    return xmax + log(r)
+end
+
+# with initial element: required by CUDA


I guess it's also required by Array when the input is empty, right?

No, empty inputs are handled by log(sum(X)), as before. Actually we never want to use this method apart for GPU arrays - the type calculations are doomed to fail e.g. for not concretely typed arrays and hence it is much safer to not rely on them and instead just compute the output. I'm not sure why the type calculations in https://github.com/JuliaGPU/GPUArrays.jl/blob/356c5fe3f83f76f8139b4edf5536cbd08d48da7f/src/host/mapreduce.jl#L49-L52 fail, I can check if this can be fixed from our side. Alternatively, maybe another package such as NNlib should provide a GPU-compatible overload of StatsFuns.logsumexp?

OK. They fail because neutral_element is only defined for a few particular functions. Another approach would be to use the type of the first element, since we already handle the case where the array is empty. That way we wouldn't need two different methods depending on whether the eltype is known.

Given that it seems doable, it would be nice to have something that works for all arrays without another package having to overload the method.

IMO it is just bad that GPUArrays requires us to provide an initial or neutral element. We really never want to do that, regardless if there is a first element or not. Retrieving the first element is bad for any stateful iterator, and is a scalar indexing operation on the GPU (which leads to warnings when calling the function - and we can not remove them without depending on CUDA). Additionally, the docstring of mapreduce says that

It is unspecified whether init is used for non-empty collections.

so there is really no point in providing init here - apart from GPU compatibility. The problem is that we can not restrict our implementation with the init hack to only GPU arrays without taking a dependency on CUDA. Hence I don't think it is completely unreasonable to provide a GPU-compatible implementation somewhere else. An alternative would be to not use the one-pass algorithm as the default algorithm (for arrays).

Retrieving the first element should be OK for any AbstractArray (contrary to some iterables). It's too bad that CUDA prints warnings in that case.

Let's keep what you have now: since the method only accepts arrays with T<:Real element type, float(T) will (almost) certainly work.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/basicfuns.jl

test/basicfuns.jl

nalimilan · 2020-09-05T19:30:07Z

src/basicfuns.jl

+    return xmax + log(r)
+end
+
+# with initial element: required by CUDA


Retrieving the first element should be OK for any AbstractArray (contrary to some iterables). It's too bad that CUDA prints warnings in that case.

Let's keep what you have now: since the method only accepts arrays with T<:Real element type, float(T) will (almost) certainly work.

nalimilan

Just a few remaining details.

src/basicfuns.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/basicfuns.jl

nalimilan

Cool. Looks good if you confirm the latest version is as fast as the previous one. I have two stylistic comments.

src/basicfuns.jl

test/basicfuns.jl

devmotion · 2020-09-07T14:15:21Z

Same benchmark as above gives now

julia> @btime StatsFuns.logsumexp($X)
  37.203 ms (0 allocations: 0 bytes)
2570.6633157197443

on my computer.

devmotion · 2020-09-07T14:24:01Z

For the calls with dims I ran

julia> using StatsFuns, BenchmarkTools, Random

julia> Random.seed!(100);

julia> n = 100_000;

julia> X = 500 .* randn(n, 100);

and got on this PR

julia> @btime StatsFuns.logsumexp($X; dims=2);
  99.331 ms (8 allocations: 2.29 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=1);
  63.163 ms (2 allocations: 2.64 KiB)

julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
  66.100 ms (6 allocations: 352 bytes)

whereas master yields

julia> @btime StatsFuns.logsumexp($X; dims=2);
  117.644 ms (16 allocations: 78.58 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=1);
  86.431 ms (5 allocations: 76.30 MiB)

julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
  72.502 ms (13 allocations: 76.29 MiB)

nalimilan · 2020-09-23T13:08:02Z

Thanks @devmotion!

devmotion added 2 commits June 8, 2020 21:04

Add logsumexp_onepass

52a63ed

Add tests

5983de9

ararslan requested a review from simonbyrne June 8, 2020 19:59

ararslan requested review from dmbates and andreasnoack August 10, 2020 16:25

dmbates approved these changes Aug 11, 2020

View reviewed changes

devmotion mentioned this pull request Aug 15, 2020

Run CI tests on Julia LTS, stable, and nightly #102

Merged

nalimilan reviewed Sep 1, 2020

View reviewed changes

src/basicfuns.jl Outdated Show resolved Hide resolved

src/basicfuns.jl Outdated Show resolved Hide resolved

mileslucas mentioned this pull request Sep 1, 2020

weighted logsumexp #101

Closed

nalimilan mentioned this pull request Sep 3, 2020

Collaboration, transfer, and adding iterators JuliaArrays/ArrayInterface.jl#38

Open

nalimilan reviewed Sep 5, 2020

View reviewed changes

devmotion and others added 4 commits September 5, 2020 16:44

Simplify implementation of logsumexp reduction over dimensions

31356bd

Remove comment

44eec25

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update documentation

51ca6db

Remove logsumexp_onepass tests since it is the default now

7937e09

nalimilan linked an issue Sep 5, 2020 that may be closed by this pull request

logsumexp for Tuples #75

Closed

nalimilan reviewed Sep 5, 2020

View reviewed changes

src/basicfuns.jl Outdated Show resolved Hide resolved

devmotion and others added 3 commits September 6, 2020 12:32

Apply suggestions

b79aceb

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Add additional check for abstract types

621c4f6

Update documentation

d0aa84a

nalimilan reviewed Sep 6, 2020

View reviewed changes

src/basicfuns.jl Outdated Show resolved Hide resolved

Use one-pass algorithm also for reductions along subsets of dimensions

0dff349

nalimilan approved these changes Sep 7, 2020

View reviewed changes

src/basicfuns.jl Show resolved Hide resolved

src/basicfuns.jl Outdated Show resolved Hide resolved

nalimilan reviewed Sep 7, 2020

View reviewed changes

test/basicfuns.jl Show resolved Hide resolved

nalimilan mentioned this pull request Sep 7, 2020

logsum -> logsumexp TuringLang/Turing.jl#500

Closed

devmotion added 2 commits September 7, 2020 14:52

Add function barrier

70995c5

Add more tests

09bd293

nalimilan merged commit 59989e1 into JuliaStats:master Sep 23, 2020

devmotion deleted the logsumexp_onepass branch September 23, 2020 13:11

This was referenced Dec 22, 2020

Use/export LogExpFunctions.jl? FluxML/NNlib.jl#252

Open

logsumexp over matrix of infinities outputs NaN #100

Closed

logsumexp fallback for Base.Iterators.Filter #63

Closed

RFC: move log/exp variants to a separate package #46

Closed

devmotion mentioned this pull request May 2, 2021

[FR] smoothed log sum implementation JuliaStats/LogExpFunctions.jl#18

Open

cossio referenced this pull request in JuliaStats/LogExpFunctions.jl Dec 9, 2021

Add onepass algorithm for logsumexp (#97)

e1fe1a4

devmotion mentioned this pull request Jul 31, 2022

Use numerically stable one-pass algorithm for log-sum-exp lawmurray/Birch#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add onepass algorithm for logsumexp #97

Add onepass algorithm for logsumexp #97

devmotion commented Jun 8, 2020 •

edited

Loading

ararslan commented Jun 8, 2020

devmotion commented Jun 8, 2020 •

edited

Loading

devmotion commented Aug 9, 2020

dmbates left a comment

devmotion commented Aug 17, 2020

dmbates commented Aug 19, 2020

nalimilan commented Sep 1, 2020 •

edited

Loading

nalimilan commented Sep 1, 2020

devmotion commented Sep 2, 2020

devmotion commented Sep 3, 2020 •

edited

Loading

nalimilan commented Sep 3, 2020

devmotion commented Sep 3, 2020

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

devmotion commented Sep 4, 2020 •

edited

Loading

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

devmotion commented Sep 4, 2020

nalimilan commented Sep 4, 2020

nalimilan Sep 5, 2020

devmotion Sep 5, 2020

nalimilan Sep 5, 2020

devmotion Sep 5, 2020 •

edited

Loading

nalimilan Sep 5, 2020

nalimilan Sep 5, 2020

nalimilan left a comment

nalimilan left a comment

devmotion commented Sep 7, 2020

devmotion commented Sep 7, 2020

nalimilan commented Sep 23, 2020

Add onepass algorithm for logsumexp #97

Add onepass algorithm for logsumexp #97

Conversation

devmotion commented Jun 8, 2020 • edited Loading

ararslan commented Jun 8, 2020

devmotion commented Jun 8, 2020 • edited Loading

devmotion commented Aug 9, 2020

dmbates left a comment

Choose a reason for hiding this comment

devmotion commented Aug 17, 2020

dmbates commented Aug 19, 2020

nalimilan commented Sep 1, 2020 • edited Loading

nalimilan commented Sep 1, 2020

devmotion commented Sep 2, 2020

devmotion commented Sep 3, 2020 • edited Loading

nalimilan commented Sep 3, 2020

devmotion commented Sep 3, 2020

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

devmotion commented Sep 4, 2020 • edited Loading

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

nalimilan commented Sep 4, 2020

devmotion commented Sep 4, 2020

devmotion commented Sep 4, 2020

nalimilan commented Sep 4, 2020

nalimilan Sep 5, 2020

Choose a reason for hiding this comment

devmotion Sep 5, 2020

Choose a reason for hiding this comment

nalimilan Sep 5, 2020

Choose a reason for hiding this comment

devmotion Sep 5, 2020 • edited Loading

Choose a reason for hiding this comment

nalimilan Sep 5, 2020

Choose a reason for hiding this comment

nalimilan Sep 5, 2020

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

devmotion commented Sep 7, 2020

devmotion commented Sep 7, 2020

nalimilan commented Sep 23, 2020

devmotion commented Jun 8, 2020 •

edited

Loading

devmotion commented Jun 8, 2020 •

edited

Loading

nalimilan commented Sep 1, 2020 •

edited

Loading

devmotion commented Sep 3, 2020 •

edited

Loading

devmotion commented Sep 4, 2020 •

edited

Loading

devmotion Sep 5, 2020 •

edited

Loading