Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: make indexing and eachcol return views of data frame columns #1856

Closed
wants to merge 5 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 23, 2019

This fixes #1844.

I open this in case there are any design comments. This is still WIP as I have to do two things:

  • review documentation in detail to make sure it is updated (I will also make other changes to documentation as I read it if it is outdated in this PR to simplify my life)
  • go though DataFrames.jl internals to make sure we do not rely on df[:col] returning the actual stored vector not its view

@bkamins
Copy link
Member Author

bkamins commented Jun 23, 2019

I already see one slight issue with this PR. Earlier when we written df[col] we could distinguish type of df by the return value (at least in "normal" cases). Now it is not possible as df[col] always returns a SubArray no matter if df is a DataFrame or SubDataFrame. (see the required change to hcat! in this PR for the consequences)

@nalimilan
Copy link
Member

I already see one slight issue with this PR. Earlier when we written df[col] we could distinguish type of df by the return value (at least in "normal" cases). Now it is not possible as df[col] always returns a SubArray no matter if df is a DataFrame or SubDataFrame. (see the required change to hcat! in this PR for the consequences)

Indeed. Though AFAICT that's not a big problem, right? We could add an argument to getcol to allow extracting the original vector rather than a view (for types where it makes sense; I'm thinking of DataFrame but also GroupedDataFrame soon probably).

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2019

I was thinking over this and I think it is mostly problem for me (as I do internal development). For people having a view always is actually more consistent.

Now you have raised a getcol function issue. We discussed earlier that it is not really needed. What do you think would be its use case.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2019

Regarding getcol. Sorry for my mistake.
What I wanted to ask is the following:

what kind of behaviors (so essentially keyword arguments) we want in getcol

assuming getcol(df, col) replaces df[col] and will return a view.

@nalimilan
Copy link
Member

It could make sense to have a keyword argument to getcol to request "the simplest type which can represent a column": i.e. Vector if possible (for DataFrame, or later GroupedDataFrame), SubVector otherwise. But I don't know how to call that argument. Anyway we can defer this.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2019

@nalimilan We have revenge of the CategoricalArray issue again. If we always return views then the problem is that a view of CategoricalArray is not AbstractCategoricalArray. This might break a lot of code that relies on checking if some column in a data frame is categorical. What do you think about this case? A direct approach is to have a special type for view of CategoricalArray defined in Categorical.jl.

@nalimilan
Copy link
Member

I think people should just switch to AbstractArray{<:Union{CategoricalString,CategoricalValue}}, as they should support views anyway.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2019

@oxinabox It turns out that returning a view by default from df.x does not play well with CategoricalArrays.jl. The problem is that such a view is not a subtype of AbstractCategoricalArray. Unfortunately we cannot make it to be such a subtype because it is already a subtype of SubArray.

The consequence is that introducing such a change would have to be synchronized with changes in all packages that use CategoricalArrays.jl and DataFrames.jl to make sure that they do not break (and they probably would as up till now the normal way to check if something is a categorical variable was checking its type against AbstractCategoricalArray or even more narrowly CategoricalArray)..

@nalimilan
Copy link
Member

So in other words it's not completely clear that the involved work and increased complexity (for us, but also for users) are worth the limited increase in safety.

@quinnj
Copy link
Member

quinnj commented Jun 25, 2019

Personally, I think we'll run into a decent amount of cases where users will be surprised by this, not unlike the switch in CSV.read to return read-only Column by default.

@oxinabox
Copy link
Contributor

I think it is worth it,
but I think the way to do it is the return a custom column type.
This means we could have out own NoResizeCategoricalVector <: AbstractCategoricalArray.

@bkamins
Copy link
Member Author

bkamins commented Jul 12, 2019

The conclusion from the discussions and the implementation in #1866 is that we will return vectors not their views so I am closing this PR.

@bkamins bkamins closed this Jul 12, 2019
@bkamins bkamins deleted the getindex_col_view branch July 15, 2019 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make getproperty(df, col) return a full length view of the column
4 participants