Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row and Column Iterator for Matrices #29749

Merged
merged 25 commits into from
Dec 3, 2018
Merged

Row and Column Iterator for Matrices #29749

merged 25 commits into from
Dec 3, 2018

Conversation

arnavs
Copy link
Contributor

@arnavs arnavs commented Oct 21, 2018

Talked in Slack about creating eachrow(m::Matrix) and eachcolumn(m::Matrix) which return an iterator over the rows and columns of M. Here's a stab at it.

@fredrikekre
Copy link
Member

See #14491

@arnavs
Copy link
Contributor Author

arnavs commented Oct 21, 2018

Thanks @fredrikekre, I updated the naming convention to match what was happening in that thread (eachrow and eachcol).

Also, what's going on with CI? Looks like we have two errors in something called sockets for 2 of the builds?

@andyferris
Copy link
Member

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

4 similar comments
@andyferris
Copy link
Member

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris
Copy link
Member

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris
Copy link
Member

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris
Copy link
Member

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris
Copy link
Member

(Github had problems earlier today... looks like they posted all my retries!)

@joshday
Copy link

joshday commented Oct 22, 2018

I'll add that I've also implemented eachrow/eachcol in OnlineStatsBase.

The difference from the implementation in this PR is that OnlineStatsBase uses a buffer instead of views (I think views were a little slower, but that could be investigated further). The code is here: https://github.com/joshday/OnlineStatsBase.jl/blob/0449a586f9229f6bf43a2214b11a271318a240a4/src/OnlineStatsBase.jl#L84-L142.

@arnavs
Copy link
Contributor Author

arnavs commented Oct 22, 2018

Replies to feedback (first @andyferris, then @joshday):

  1. Fully agree, would love if the slicing API was n-dimensional.

  2. Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors.

    For sure, we could implement rows(M) = collect(eachrow(M)) or something similar. Is that what you had in mind? (Edit: I still think it's valuable to keep the lightweight view iterator for instances that aren't conducive to reproducing all the data.)

  3. Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview

    Sounds good, I'll check this out.

  4. Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

    For sure. Maybe if people think this is a good idea (cc: @ararslan, @StefanKarpinski, @fredrikekre), we can merge this and then implement a method eachslice(A::AbstractArray)? Or, if we need to workshop this a bit, that's OK too.

  5. I'll add that I've also implemented eachrow/eachcol in OnlineStatsBase...I think views were a little slower...

    Quite possible, but that's above my pay grade. I'll let others weigh in.

@StefanKarpinski
Copy link
Sponsor Member

@mbauman, can you take a look at this?

@mbauman
Copy link
Sponsor Member

mbauman commented Oct 22, 2018

I agree with most all of what @andyferris wrote above, except that I'd suggest we start as simple as possible. I'll also note that there's another package implementation in JuliennedArrays.jl.

I'd suggest simply:

eachrow(A::AbstractArray) = (view(A, i, :) for i in axes(A, 1))
eachcol(A::AbstractArray) = (view(A, :, j) for j in axes(A, 2))

For the general case, it looks like constant propagation can almost handle:

eachslice(A, d) = (view(A, ntuple(n->n==d ? i : (:), ndims(A))...) for i in axes(A, d))

but we're not quite there yet. It'll take a bit more to make this type-stable.

Edit: oh, of course, we can just use selectdim:

eachslice(A, d) = (selectdim(A, 

Copy link
Sponsor Member

@mbauman mbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see this as a more general structure, but I've added a few pedagogical comments based on the code you've written that I hope will be helpful.

stdlib/LinearAlgebra/src/generic.jl Outdated Show resolved Hide resolved
stdlib/LinearAlgebra/src/generic.jl Outdated Show resolved Hide resolved
stdlib/LinearAlgebra/src/generic.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Sponsor Member

mbauman commented Oct 22, 2018

buffer instead of views

I think I prefer the views behavior. Yes, buffers can be faster in many situations, but they also can have strange effects if you, e.g., collect the iterator. I prefer the view behaviors, and there's hope that someday they'll be faster (#14955).

@andyferris
Copy link
Member

we can merge this and then implement a method eachslice(A::AbstractArray)

except that I'd suggest we start as simple as possible

I agree - let's do baby steps! :)

My concern is more on the bikeshedding front - that a function name such as eachrow is great for an iterator (for row in eachrow(matrix) is indeed very readable), but IMO not so great for a general function which transforms an array into another array. If we know the function will be expanded to included more functionality later, we may as well rename it now to reduce churn.

Our typical policy is to pick a simple English verb for Julia functions (and I don't see why we would deviate from that). Ideas that come to mind include slice, split, splitdims.

I'll also note that there's another package implementation in JuliennedArrays.jl.

Yes, I meant to come back and acknowledge JuliennedArrays (sorry @bramtayl). While I am talking function names - while I really appreciate the julienne play on words (genius!), for Julia Base perhaps a simpler English verb might be preferrable for new users (especially for those whose mother tongue isn't English)?

@nalimilan
Copy link
Member

Honestly I don't think adding unexported functions in one release will help people to fix their code in advance. Only people following Julia development closely will be aware of that, and these people are precisely the ones which will adapt before 1.1 is out anyway (like DataFrames). Others will only notice the breakage in 1.2 when we eventually export these symbols.

Regarding DataFrames, let's not take it as an argument to delay the addition of new exports. @bkamins has already made a PR to make the API more consistent with this PR, and we can merge a fix even before this PR is merged in Julia, and tag it in a few days.

@arnavs
Copy link
Contributor Author

arnavs commented Dec 1, 2018

If everyone agrees that ‘eachslice’ is good to export, perhaps we could do that now and continue the discussion on the other two?

Not to put too fine a point on it, but I think we’ve reached the convergence point from this PR (the functionality we want is added, documented, tested, and won’t break things downstream). If I’m wrong, happy to iterate further, but this feature request has been open for years now and it would be nice to close.

@mbauman
Copy link
Sponsor Member

mbauman commented Dec 1, 2018

I would just very much like to find a conclusion here. @arnavs has been very tenacious, persistent and patient for this first PR. I leaned on the side of being conservative to help make that happen but apparently that backfired.

@nalimilan
Copy link
Member

AFAICT we all agree the PR can be merged. The debate is about whether we should add more exports or not. But if we don't reach an agreement soon better merge and continue the discussion after that.

@bkamins
Copy link
Member

bkamins commented Dec 2, 2018

If some package used e.g. eachcol earlier (like DataFrames.jl) the maintainers will have to introduce a conditional definition of eachcol and Base.eachcol anyway to keep supporting Julia 1.0 (or handle this duality via Compat.jl) no matter if we export it in Julia 1.1 or 1.2.

@AzamatB
Copy link
Contributor

AzamatB commented Dec 2, 2018

My 2¢: I agree with @nalimilan and @bkamins and think there is no need to postpone exports to 1.2

@arnavs
Copy link
Contributor Author

arnavs commented Dec 2, 2018

Alright, I re-added the exports. Let's never think about this PR ever again :)

@StefanKarpinski
Copy link
Sponsor Member

@nalimilan or @mbauman, please merge if and when you think this is ready (with appropriate squashing).

@mbauman
Copy link
Sponsor Member

mbauman commented Dec 3, 2018

I'm thrilled DataFrames folks are on board here and will accommodate the breakage. Thanks everyone. Let's merge!

@nalimilan
Copy link
Member

Thanks! DataFrames PR is JuliaData/DataFrames.jl#1614, will be tagged shortly.

simonbyrne added a commit that referenced this pull request Jun 12, 2019
…n `EachSlice` object (along with `EachRow`/`EachCol` aliases). The main benefit is that it will allow dispatch on the iterator to provide more efficient methods, e.g.

```
sum(A::EachRow) = vec(sum(parent(A), dims=1))
```

This will encourage the use of `eachcol`/`eachrow` to resolve ambiguities in user-facing APIs, in particular, the "obsverations as rows vs columns" problem in the statistics/ML packages.

This also makes `eachslice` work over multiple dimensions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s]
Projects
None yet
Development

Successfully merging this pull request may close these issues.