WIP: make indexing and eachcol return views of data frame columns #1856

bkamins · 2019-06-23T22:38:21Z

This fixes #1844.

I open this in case there are any design comments. This is still WIP as I have to do two things:

review documentation in detail to make sure it is updated (I will also make other changes to documentation as I read it if it is outdated in this PR to simplify my life)
go though DataFrames.jl internals to make sure we do not rely on df[:col] returning the actual stored vector not its view

bkamins · 2019-06-23T23:00:16Z

I already see one slight issue with this PR. Earlier when we written df[col] we could distinguish type of df by the return value (at least in "normal" cases). Now it is not possible as df[col] always returns a SubArray no matter if df is a DataFrame or SubDataFrame. (see the required change to hcat! in this PR for the consequences)

nalimilan · 2019-06-25T08:15:31Z

I already see one slight issue with this PR. Earlier when we written df[col] we could distinguish type of df by the return value (at least in "normal" cases). Now it is not possible as df[col] always returns a SubArray no matter if df is a DataFrame or SubDataFrame. (see the required change to hcat! in this PR for the consequences)

Indeed. Though AFAICT that's not a big problem, right? We could add an argument to getcol to allow extracting the original vector rather than a view (for types where it makes sense; I'm thinking of DataFrame but also GroupedDataFrame soon probably).

bkamins · 2019-06-25T09:25:42Z

I was thinking over this and I think it is mostly problem for me (as I do internal development). For people having a view always is actually more consistent.

Now you have raised a getcol function issue. We discussed earlier that it is not really needed. What do you think would be its use case.

bkamins · 2019-06-25T09:47:12Z

Regarding getcol. Sorry for my mistake.
What I wanted to ask is the following:

what kind of behaviors (so essentially keyword arguments) we want in getcol

assuming getcol(df, col) replaces df[col] and will return a view.

nalimilan · 2019-06-25T10:05:16Z

It could make sense to have a keyword argument to getcol to request "the simplest type which can represent a column": i.e. Vector if possible (for DataFrame, or later GroupedDataFrame), SubVector otherwise. But I don't know how to call that argument. Anyway we can defer this.

bkamins · 2019-06-25T12:26:02Z

@nalimilan We have revenge of the CategoricalArray issue again. If we always return views then the problem is that a view of CategoricalArray is not AbstractCategoricalArray. This might break a lot of code that relies on checking if some column in a data frame is categorical. What do you think about this case? A direct approach is to have a special type for view of CategoricalArray defined in Categorical.jl.

nalimilan · 2019-06-25T15:43:04Z

I think people should just switch to AbstractArray{<:Union{CategoricalString,CategoricalValue}}, as they should support views anyway.

bkamins · 2019-06-25T16:06:13Z

@oxinabox It turns out that returning a view by default from df.x does not play well with CategoricalArrays.jl. The problem is that such a view is not a subtype of AbstractCategoricalArray. Unfortunately we cannot make it to be such a subtype because it is already a subtype of SubArray.

The consequence is that introducing such a change would have to be synchronized with changes in all packages that use CategoricalArrays.jl and DataFrames.jl to make sure that they do not break (and they probably would as up till now the normal way to check if something is a categorical variable was checking its type against AbstractCategoricalArray or even more narrowly CategoricalArray)..

nalimilan · 2019-06-25T16:26:03Z

So in other words it's not completely clear that the involved work and increased complexity (for us, but also for users) are worth the limited increase in safety.

quinnj · 2019-06-25T17:02:25Z

Personally, I think we'll run into a decent amount of cases where users will be surprised by this, not unlike the switch in CSV.read to return read-only Column by default.

oxinabox · 2019-06-26T11:00:10Z

I think it is worth it,
but I think the way to do it is the return a custom column type.
This means we could have out own NoResizeCategoricalVector <: AbstractCategoricalArray.

bkamins · 2019-07-12T19:57:26Z

The conclusion from the discussions and the implementation in #1866 is that we will return vectors not their views so I am closing this PR.

bkamins added 2 commits June 24, 2019 00:35

initial implementation of returning views of data frame columns

8558439

fix hcat!

d771f14

bkamins added 2 commits June 24, 2019 08:29

fix typo

a6fe6bd

fix hcat tests

833de0a

further fixes of tests

b52f017

bkamins mentioned this pull request Jun 26, 2019

Make getproperty(df, col) return a full length view of the column #1844

Closed

bkamins closed this Jul 12, 2019

bkamins deleted the getindex_col_view branch July 15, 2019 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: make indexing and eachcol return views of data frame columns #1856

WIP: make indexing and eachcol return views of data frame columns #1856

bkamins commented Jun 23, 2019

bkamins commented Jun 23, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

quinnj commented Jun 25, 2019

oxinabox commented Jun 26, 2019

bkamins commented Jul 12, 2019

WIP: make indexing and eachcol return views of data frame columns #1856

WIP: make indexing and eachcol return views of data frame columns #1856

Conversation

bkamins commented Jun 23, 2019

bkamins commented Jun 23, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

bkamins commented Jun 25, 2019

nalimilan commented Jun 25, 2019

quinnj commented Jun 25, 2019

oxinabox commented Jun 26, 2019

bkamins commented Jul 12, 2019