getindex(::Fill, idx) #863

willtebbutt · 2020-12-28T11:27:28Z

As pointed out here by @mcabbott , the following happens:

julia> using FillArrays, Zygote

julia> Zygote.gradient(x -> x[1], Fill(3, 2))[1]
2-element Array{Int64,1}:
 1
 0

The type is innappropriate -- the answer should be

(value=1, axes=nothing)

or at the very least another Fill.

The text was updated successfully, but these errors were encountered:

DhairyaLGandhi · 2020-12-29T12:47:18Z

Hmm, the output seems consistent since you have gradients defined for some elements and zeros for others.

Do we want to say DNE for the remainder of the elements? Either way this would materialize the array iiuc

willtebbutt · 2020-12-29T13:22:04Z

The point is not that the output is inconsistent (I agree that the output is consistent with what you would get if you provided an Array), it's that the optimal thing to do from a performance perspective would be to return (value=1, axes=nothing). The current behaviour happens largely negaetes the benefit of using a Fill if you plan to do AD with it using Zygote and happen to index into it, since the memory (and hence time) requirements are O(length_of_fill), rather than O(1).

(Part) of a solution is to stop getting in the way of Zygote automatically deriving the correct adjoint by restricting this rrule to something less general than AbstractArray.

edit: only part of a solution because IIRC when I tried this I encountered some type stability issues with literal_getproperty and its adjoint that meant that, while the correct NamedTuple adjoint was returned, the performance didn't improve.

mcabbott · 2021-01-01T19:57:38Z

I guess cases like this are the motivation for treating Fill as being one number:

julia> gradient(y -> sum(sqrt, gradient(x -> sum(x.^2), y)[1]), reshape(1:6,2,3))[1]
2×3 Matrix{Float64}:
 0.707107  0.408248  0.316228
 0.5       0.353553  0.288675

The gradient of the inner sum is a Fill, which gets fed into another function which Zygote is differentiating, and not materialising that array would save some allocations, as you say.

But what seems a little tricky is that, if the returned gradient escapes from Zygote, the returning the one-number version seems like it violates the fact that Fill <: AbstractArray. If I type this, I would be pretty surprised to be given back something not == the result from using dense fill instead:

julia> gradient(x -> sum(cumsum(x, dims=2)), Fill(pi,2,7))[1]
2×7 Matrix{Float64}:
 7.0  6.0  5.0  4.0  3.0  2.0  1.0
 7.0  6.0  5.0  4.0  3.0  2.0  1.0

I suspect no-one literally types this, except when making up examples. But there might be other contexts where code outside of Zygote may do this, and expect this optimisation not to change the answer.

Can these two contexts be clearly distinguished somehow, so that Zygote knows when it's OK to "pre-accumulate" the gradient? In the first case, I think the outer Zygote ought to be able to see the constructor which is making the Fill, while in the second it obviously cannot. That isn't visible to the getindex adjoint, but perhaps the eventual use of InplaceThunk will mean that the decision of whether to accumulate into a dense matrix or one number can be made a bit later?

Notice also that my first example uses a ReshapedArray{..., UnitRange} as input. That of course has complexity O(1) and zero real parameters, but Julia falls back to seeing it as an AbstractMatrix, which seems like the desired behaviour, for a fairly common way to make repeatable junk data.

willtebbutt · 2021-01-01T21:04:39Z

But what seems a little tricky is that, if the returned gradient escapes from Zygote, the returning the one-number version seems like it violates the fact that Fill <: AbstractArray. If I type this, I would be pretty surprised to be given back something not == the result from using dense fill instead:

I see where you're coming from, but I think we're stuck with it as the raw output of Zygote for performance reasons if no other reason (same for all structured arrays really).

Can these two contexts be clearly distinguished somehow, so that Zygote knows when it's OK to "pre-accumulate" the gradient? In the first case, I think the outer Zygote ought to be able to see the constructor which is making the Fill, while in the second it obviously cannot. That isn't visible to the getindex adjoint, but perhaps the eventual use of InplaceThunk will mean that the decision of whether to accumulate into a dense matrix or one number can be made a bit later?

I've thought about this a bit before. It should be possible to implement a separate piece of tooling to convert the output appropriately by comparing it with the input. i.e. if you know the input was a Fill and get a gradient that is a NamedTuple{(:value, :axes)}, you know that it can be converted to a Fill of some kind (or something like that).

Notice also that my first example uses a ReshapedArray{..., UnitRange} as input. That of course has complexity O(1) and zero real parameters, but Julia falls back to seeing it as an AbstractMatrix, which seems like the desired behaviour, for a fairly common way to make repeatable junk data.

Sorry, I'm not sure what you mean by this. Could you elaborate?

mcabbott · 2021-01-01T21:46:18Z

Maybe the short version is that I'd like more examples of where Fill gets used. Your argument is that it must be treated as structurally constant, and while I agree there are examples where this is safe and would speed things up (like the one I gave) I'm not sure that this can't lead to very surprising answers. It's a performance optimisation which may have correctness problems.

It should be possible to implement a separate piece of tooling to convert the output appropriately by comparing it with the input. i.e. if you know the input was a Fill and get a gradient that is a NamedTuple{(:value, :axes)}, you know that it can be converted to a Fill of some kind

No, that's not what I meant. The complaint in my second example is that replacing fill with Fill would (if it were made structurally constant) produce very different numerical results.

I'm not sure what you mean by this.

What I'm trying to say is that a decision to treat (say) Diagonal as structurally zero off-diagonal doesn't automatically lead to a decision for all other matrix types with less than O(N^2) storage. My reshaped range is an example where (I think) it would be pretty surprising to try to preserve structure, or to simply fail. Although you could, you could regard the gradient as influencing only the endpoint of the range, and accumulate one number...

Here's another example:

gradient(x -> range(1,x,length=10)[5], 13)  # Need an adjoint for constructor StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}. Gradient is of type Vector{Float64}
gradient(x -> x[5], range(1,13,length=10))  # ([0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
gradient(x -> x[5], collect(range(1,13,length=10)))  # the same

Similar arguments apply here -- when the range is constructed within the scope of Zygote, it would I believe always be safe to make it structurally rigid, i.e. to store only (start=..., stop=..., step=...) for the gradient. (And this would save allocations, etc.) But when it's not, then it doesn't seem obviously right to impose this, and hence give a different answer for the 2nd & 3rd lines.

I think any structural constraint means that x == y does not imply f'(x) == f'(y). That seems a surprising property, and one which I think we should be a little cautious about applying too widely.

willtebbutt · 2021-01-01T23:37:06Z

Maybe the short version is that I'd like more examples of where Fill gets used.

One place that I use Fills extensively is in TemporalGPs.jl to represent time-invariant dynamics. I represent a model's transition dynamics by an AbstractVector x of things, where the nth element of x represents the dynamics transitioning the state from time t-1 to t -- if each element of x is the same, the system is time-invariant. In this case I represent x as a Fill. When x is long and its elements are high-dimensional, as is commonly the case in spatio-temporal settings, the Fill-is-a-struct point of view saves a very large amount of memory and time.

It's a performance optimisation which may have correctness problems.

I think any structural constraint means that x == y does not imply f'(x) == f'(y). That seems a surprising property, and one which I think we should be a little cautious about applying too widely.

I think these points are the crux of where our views differ, and I think it comes down to differing views on Zygotes / ChainRuless semantics. I'm going to try and explain how I think adjoint types ought to be defined, and why I believe it's a good idea to define them that way.

Firstly, my rough (current) personal definition of a structured array type is that an array x of said type can only represent a subset of the arrays that an Array of the same size can represent. e.g. Diagonal can't represent an array with non-zero off-diagonals, and similar statements can be made about Fills and StepRangeLens. Conversely, my definition of unstructured is something that is not structured.

My view is that the correct thing to do is to treat Fill in the same way as any other struct and define things that represent its adjoints as NamedTuples. My understanding of your position is that the correct thing to do for an AbstractArray is to define its adjoint to be another AbstractArray (hopefully I've got that correct).

The first reason I think it's a good idea to treat many structured arrays as structs is that, for me, one basic requirement of a reverse-mode AD system is that a single reverse-pass should have the same asymptotic time-complexity as the forwards-pass, and have memory requirements that are at most proportional to the sum of the memory required to store the output of every operation on the forwards pass (the usual linear memory scaling property of reverse-mode). From my perspective these properties are essential to make reverse-mode useful -- or rather I've yet to hear a convincing argument otherwise.

You can construct a hand-wavy inductive argument that a standard reverse-mode AD system will satisfy property this provided that all of its rrules do. From this I believe it's clear that it's a necessary (but insufficient) condition that the objects used to represent gradients must require no more memory (in some vague asymptotic sense) than the objects whose gradient they represent.

If we accept the above, it follows that the gradients of any structured array which uses asymptotically less memory than an Array of the same size cannot be defined to be an Array and, by extension, that x == y doesn't imply f'(x) == f'(y). This says nothing about whether the correct thing to do is use a NamedTuple or something else, but I believe the NamedTuple / Composite approach is sufficient.

My second reason for wanting to treat structured arrays as being structured is that I actually believe this to be a desirable property -- I would very much like to be able to express that a particular array is always going to be diagonal, and I would like my AD tooling to respect that. If one adopts a structural perspective this can be achieved very naturally by utilising an appropriate structured array. Conversely, it's less clear to me how to achieve this cleanly if all AbstractArrays are treated as being unstructured. This second point is less important to me than the first.

So all of this is to say that while I agree that it is certainly surprising at first to find that x == y doesn't imply f'(x) == f'(y), I believe it to be something we should resign ourselves to in the context of AD.

mcabbott · 2021-05-03T06:31:16Z

for me, one basic requirement of a reverse-mode AD system is that a single reverse-pass should have the same asymptotic time-complexity as the forwards-pass

Perhaps it's useful to back up from AD, and ask simpler questions of FillArrays. I claim that operations which could in principle be O(1) will often end up O(length), and that this is almost the entire point of the type's existence. That is, the main reason to use FillArray <: AbstractArray instead of say Val, is precisely that you not only accept but desire all kinds of correct, but probably slower, fallback routines. If you want to be sure that every function which touches this object only does O(1) things, then you could just pass one number around.

That's not precisely the only reason. The struct also contains the size, which perhaps it's convenient to pass along together. And perhaps in f(g(h(x::Fill))) it's more convenient to write one version of the intermediate g, h functions passing AbstractArray along without looking inside, before hitting a specialised f. I think this is the sort of usage you are describing above. These purposes could also be served by a StrictFill struct which did not define getindex, which would then guarantee you errors if some mistake causes it to go to a generic f. About StrictFill there could be no such paradoxes -- its complexity is predictable, and its gradient is obvious.

The situation with Diagonal is similar: Base makes no guarantee that it won't be iterated like a matrix, O(N^2) when it could be O(N). Even within LinearAlgebra there are many functions which do this, because nobody has got around to optimising them. This fallback behaviour is (I claim) the reason you want Diagonal <: AbstractMatrix in the first place. Using Diagonal instead of diagm gets you the possibility of faster paths, but if you want a guarantee, then you should just use a Vector.

I guess that's a long way of saying that I like the second argument more:

I would very much like to be able to express that a particular array is always going to be diagonal, and I would like my AD tooling to respect that.

More basically, I've argued forever that x::Real should never produce a complex gradient. It is not useful to know that you can descend from the bottom of the parabola by going in an imaginary direction, if the problem only mentioned real numbers. If you can opt-in to complex gradients by providing a complex position (as you opt-in with sqrt(-2+0.0im)), then this is already an example where 1.0 == 1+0.0im does not imply that the gradients match.

Returning to arrays, this preservation does still seem much more natural for say Symmetric, than for Fill. I take the former seriously as a mathematical constraint on your algebra, while the obvious mathematical constant array is called a number. Certainly fill() is just something you occasionally need to write as an implementation detail.

Maybe this is a sharper version of my major concern: If someone has replaced fill() with Fill() as an invisible optimisation somewhere deep in their code, which you now feed to AD, can their change alter the result you get? When the AD treats it as a struct as proposed, I mean. Hand-wavy proofs are accepted.

If someone passes a Fill from outside the AD code, then I'm not super-sure, call that a minor concern. Maybe nobody does that. They do write gradient(f, reshape(1:12, 3, 4)) all the time for docstring examples. Maybe it's OK to break some of those? Or maybe user-facing functions like gradient should collect or warn, so as not to let a weird tangent type escape out into the world? But this is a secondary thing.

willtebbutt · 2021-09-23T17:25:47Z

I think the discussion has generally moved on since I opened this, but it's worth pointing out that quite literally no subtype of AbstractArray will work with the current definition of getindex without adding an adjoint for its constructor because you need to use the structural tangent for that.

Conversely, if we restrict the definition of getindex, things actually stand a chance of working. For example, on master, we get:

using Zygote
using StaticArrays
julia> Zygote.gradient(x -> SVector{2}(x)[1], (1.0, 2.0))
ERROR: Need an adjoint for constructor SVector{2, Float64}. Gradient is of type Zygote.OneElement{Float64, 1, Tuple{Int64}, Tuple{SOneTo{2}}}
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] (::Zygote.Jnew{SVector{2, Float64}, Nothing, false})(Δ::Zygote.OneElement{Float64, 1, Tuple{Int64}, Tuple{SOneTo{2}}})
    @ Zygote ~/.julia/dev/Zygote/src/lib/lib.jl:323
  [3] (::Zygote.var"#1788#back#231"{Zygote.Jnew{SVector{2, Float64}, Nothing, false}})(Δ::Zygote.OneElement{Float64, 1, Tuple{Int64}, Tuple{SOneTo{2}}})
    @ Zygote ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
  [4] Pullback
    @ ~/.julia/packages/StaticArrays/vxjOO/src/SArray.jl:23 [inlined]
  [5] (::typeof(∂(SVector{2, Float64})))(Δ::Zygote.OneElement{Float64, 1, Tuple{Int64}, Tuple{SOneTo{2}}})
    @ Zygote ~/.julia/dev/Zygote/src/compiler/interface2.jl:0
  [6] Pullback
    @ ~/.julia/packages/StaticArrays/vxjOO/src/SVector.jl:20 [inlined]
  [7] Pullback
    @ ./REPL[7]:1 [inlined]
  [8] (::typeof(∂(#3)))(Δ::Float64)
    @ Zygote ~/.julia/dev/Zygote/src/compiler/interface2.jl:0
  [9] (::Zygote.var"#50#51"{typeof(∂(#3))})(Δ::Float64)
    @ Zygote ~/.julia/dev/Zygote/src/compiler/interface.jl:41
 [10] gradient(f::Function, args::Tuple{Float64, Float64})
    @ Zygote ~/.julia/dev/Zygote/src/compiler/interface.jl:76
 [11] top-level scope
    @ REPL[7]:1

vs restricting the applicability of the @adjoint for getindex to something that excludes SArrays (I went with StridedArray on a local branch, but I'm not convinced that's precisely the correct choice):

julia> Zygote.gradient(x -> SVector{2}(x)[1], (1.0, 2.0))
((1.0, nothing),)

which is the correct result.

edit: on a separate note, the current error message is misleading. It wrongly suggests that the need for an adjoint for the constructor is the only possible explanation for the error, whereas it could easily be that a rule needs modification.

willtebbutt · 2021-09-23T17:32:47Z

Moreover, it's really hard to fix and let AD run on however getindex is implemented. In particular, it's not obvious how to write an @adjoint which says "just run AD on this implementation of getindex, ignore the rule that Zygote has" without modifying the implementation of getindex for the type in question.

willtebbutt mentioned this issue Dec 28, 2020

WIP: cache-less contexts #864

Closed

This was referenced May 4, 2021

Improvements to rules for norm JuliaDiff/ChainRules.jl#337

Merged

Explain the Abstract Primals Problem JuliaDiff/ChainRulesCore.jl#343

Open

mcabbott added the discussion label Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getindex(::Fill, idx) #863

getindex(::Fill, idx) #863

willtebbutt commented Dec 28, 2020

DhairyaLGandhi commented Dec 29, 2020

willtebbutt commented Dec 29, 2020 •

edited

Loading

mcabbott commented Jan 1, 2021

willtebbutt commented Jan 1, 2021

mcabbott commented Jan 1, 2021 •

edited

Loading

willtebbutt commented Jan 1, 2021 •

edited

Loading

mcabbott commented May 3, 2021 •

edited

Loading

willtebbutt commented Sep 23, 2021 •

edited

Loading

willtebbutt commented Sep 23, 2021

getindex(::Fill, idx) #863

getindex(::Fill, idx) #863

Comments

willtebbutt commented Dec 28, 2020

DhairyaLGandhi commented Dec 29, 2020

willtebbutt commented Dec 29, 2020 • edited Loading

mcabbott commented Jan 1, 2021

willtebbutt commented Jan 1, 2021

mcabbott commented Jan 1, 2021 • edited Loading

willtebbutt commented Jan 1, 2021 • edited Loading

mcabbott commented May 3, 2021 • edited Loading

willtebbutt commented Sep 23, 2021 • edited Loading

willtebbutt commented Sep 23, 2021

willtebbutt commented Dec 29, 2020 •

edited

Loading

mcabbott commented Jan 1, 2021 •

edited

Loading

willtebbutt commented Jan 1, 2021 •

edited

Loading

mcabbott commented May 3, 2021 •

edited

Loading

willtebbutt commented Sep 23, 2021 •

edited

Loading