Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making setproperty copy Vector #1846

Closed
oxinabox opened this issue Jun 10, 2019 · 31 comments
Closed

Consider making setproperty copy Vector #1846

oxinabox opened this issue Jun 10, 2019 · 31 comments

Comments

@oxinabox
Copy link
Contributor

This is infollow up to other threads trying to remove all the ways users can resize columns out of sync and thus break the DataFrame,

#1844 (comment)

There would be still one hole:

df.a = vector

One option, and I am not completely sold on it, but that I feel is worth considering.
Is that we could make setproperty and single argument setindex,
copy the input.

This would be consistent with the behavior of the constructor as copying the Vectors,

But unlike that there is no easy way to pass in a copycols=false kwarg or a Dataframe! constructor.

So if we were to do that, I suggest we would have a @nocopy macro to bypass this.,
so that

@nocopy df.a = vector

would not copy vector but would dispatch to some alternative nocopy_setcol
where as

df.a = vector

would dispatch to setcol (which would intern dispatch to nocopy_setcol(df, col, copy(vector))

@bkamins
Copy link
Member

bkamins commented Jun 10, 2019

Actually we could have setcol and getcol with copycols kwarg and then decide what df.a and df[:a] should expand to. If we added setcol and getcol then probably we should rename haskey to hascol and add colindex as discussed earlier with @nalimilan.

But let us wait a bit to hear what other people think that df.a should be a default behavior.

@oxinabox
Copy link
Contributor Author

We should focus on setcol here, since #1844 is basically on getcols, and we don't have pressing need to make them match (copying input is very different to copying output)

I think it makes sense to make it take a copycols kwarg.
I am not convinced that defaulting that to true is a good idea,
but

Pros of Defaulting copycols on setcol to true

  • it would be consistent with the constructor,
  • safe by default

Pros of Defaulting copycols on setcol to false

  • it would be consistant with setfield on a normal struct
  • faster and less allocating

(Ideally we just want copy on write still, or at least for julia to optimize away any copies of vectors that are going to go out of scope according to escape analysis. But neither of those seems likely any time soon.)

@bkamins
Copy link
Member

bkamins commented Jun 10, 2019

we don't have pressing need to make them match

Not pressing 😄, but consistency in the design is very important, as then users, even if they do not know the documentation in detail, tend to do the right things (but yes - we should have separate issues to discuss it, I just wanted to put here the whole possible target design in one place).

So my intuition is that we should go either for copy by default here and in #1844 or no copy by default (and in the no copy case view could be considered in getcol, if we copy by default then probably view is not necessary in getcol as an option).

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 10, 2019

we don't have pressing need to make them match
Not pressing 😄, but consistency in the design is very important

To be clear, I agree. Note that I referenced consistency in both my dot points (for the constructors and for normal setfield respectively).
But consistency between set and get copying is not one I feel is particularly intuitive.
I can see an intuition that would expect them to be antisymetric.
This would come from an incorrect assumption that
"if it was copied on the way in, why would it be copied on the way out?".
They are rather different operations. to me. Even though it boils down to copying the right hand side,
if one thinks of it as a property of the thing being copied then it doesn't hold.
in x = df.y then it is the dataframe column being copied, not the vector.
whereas in df.y = x it is the vector being copied, not dataframe column.

I don't think it would be too unnatural to copy on the way in for purposes of matching how the constructor, and hcat and join, work.
Following a consistent logic that all added columns are always copies.

And then having some other behavior for getcol because efficiency matters, and columns are read more often than they are added; and so the tradeoff of being a bit more consitent on the copying-ness of getcol and setcol should match (in what amounts to an implementation detail if we have actually plugged all the holes some way or another), is not worth it.

Another scattered thought:

I do not like the idea that df2.x = df1.x would do two copies.
I would like to avoid that world.

@nalimilan
Copy link
Member

Having df.col = vector make a copy indeed sounds consistent with what we now do with constructors. But that would mean we need a way to avoid making a copy, either using @nocopy df.col = vector, @nocopy df[col] = vector and/or setcol!(df, vector, copycols=false).

As a data point, do you know how easy/clean it would be to implement @nocopy? If we went that route, maybe it could be used elsewhere too.

@oxinabox
Copy link
Contributor Author

how easy/clean it would be to implement @NoCopy? If we went that route, maybe it could be used elsewhere too.

There are many ways to do this (TensorFlow.jl does a similar tranform using traits to indicate what ops should change, but for simplicty here is one that just lists them all

The simple version is simple, makng it recursive and making it handle various different ways to express the same call is a bit more fiddly

using MacroTools

nocopy_setproperty(args...) = setproperty(args...)
nocopy_setproperty(df::AbstractDataFrame, args...) = setcol(D)

const no_copy_ops = (DataFrame, setproperty, ...)

macro nocopy(expr)
    expr isa Expr && expr.head ==:call && expr.args[1] in no_copy_ops
        expr.args[1] = subexpr.args[1] = getfield(DataFrames, Symbol(:nocopy_subexpr.args[1]))
    end
    return expr
end

macro recurisive_nocopy(expr)
    MacroTools.postwalk(expr) do subexpr
        if subexpr isa Expr && subexpr.head==:call
            if subexpr.args[1] in no_copy_ops
                # handle foo(x)
                subexpr.args[1] = getfield(DataFrames, Symbol(:nocopy_subexpr.args[1]))
            elseif subexpr.args[1] isa Expr 
                subsubexp = subexpr.args[1]
                # handle DataFrames.foo(x)
                if && subexpr.head == :. && subsubexpr.args[2] in no_copy_ops
                    subsubexpr.args[2] = getfield(DataFrames, Symbol(:nocopy_subexpr.args[1]))
                end
            end
            #TODO: handle DataFrames.Base.foo() etc this should be via recursion while there are :. s
        end
        return subexpr
    end
end

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

If we decided to define setcol! and require @nocopy to be directly in front of the assignment
then the macro can be relatively simple I think.
It would just rewrite df.col = some_expression and df[col] = some_expression as setcol!(df, col, some_expression, copycols=false).

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

In general the question is if we need this @nocopy macro and if setcol! would not be enough. I am not sure what would be the best API here.

@quinnj
Copy link
Member

quinnj commented Jun 11, 2019

Here's a slightly more radical idea for the more general issue of preventing non-DataFrames column mutations (which can lead to undesirable column states): the jl_array_t source type (basically the C-equivalent of Array) has a "shared array" bit, which prevents array resizing when flipped on (as far as I know this is only set by mmapped arrays and SharedArrays). We could potentially convince @JeffBezanson and/or @vtjnash to export C-side getter/setter for setting this "shared" bit on Arrays. Then DataFrames could set the bit when ingesting, and unset->mutate->set on mutating operations. That would mean if users still had references to the original vectors, doing resizing would fail (of course unless they do their own unset->mutate operation, which hopefully would be used appropriately).

Now, this isn't quite a silver bullet because we still have the myriad custom array types, but perhaps this could become a formal API of AbstractArray types; in fact, now that I think about it, @Keno threw out some ideas already in JuliaLang/julia#31630 around a freeze/melt API formally, but geared more towards potential compiler optimizations.

@vtjnash
Copy link
Contributor

vtjnash commented Jun 11, 2019

There's no end to the set of things that users could do to make your application state inconsistent. At best you can just make it more tedious. We track that bit to avoid memory corruption (so we can at least given decent errors as it falls apart). In the future, I hope it'll get subsumed by the Buffer type.

But I suspect that requiring magic code sequences like unset->mutate->set or freeze/melt will just lead to more awkward source code on the whole for little or no benefit. It seems a bit like the const annotation in C(++): seems great on paper, but doesn't actually mean what you want, gets in the way much of time, and can't actually enable the optimizations you usually wanted.

@NoCopy
Copy link

NoCopy commented Jun 12, 2019

I guess I was called out by accident. No worries.

Looks like you have a good convo rolling and know your shit. Respect my friends.

@bkamins
Copy link
Member

bkamins commented Jun 13, 2019

So what is the conclusion here?

  • should df.col = vector make a copy (I understand yes)
  • should we define setcol! (I am not sure - it depends on what we decide with @nocopy as we should provide an operation allowing to add a vector to a data frame without copying)
  • should we define @nocopy (I am not very convinced - this is a nice idea, but it might be too magic to normal users - but maybe I am wrong)

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 19, 2019

  • should df.col = vector make a copy

Yes.

should we define setcol!

Yes, which takes a copycols=true kwarg.

should we define @nocopy

Maybe, but the way it should work if we do is to convert calls to setproperty and setindex into setcol!(...; copycols=false).
So we do still need the setcol!

@bkamins
Copy link
Member

bkamins commented Jun 19, 2019

I will introduce it with setindex! PR which is next in the queue after #1847 and #1840 are merged.
Also then #1844 can be implemented in parallel to setindex! PR.

@bkamins
Copy link
Member

bkamins commented Jun 21, 2019

I will soon start implementing the setindex! PR. Given our discussion I will implement the following:

  • add exported setcol! function with copycols kwarg
  • make setindex! and setproperty copy the vector on assignment

The latter WILL NOT produce a deprecation warning as it would lead to massive deprecation warnings in actual codes.

If there any objections to this plan please comment.

@bkamins
Copy link
Member

bkamins commented Jun 22, 2019

@nalimilan - are you OK with this? (given #1840 is essentially finished this is the next thing I will work on)

@bkamins
Copy link
Member

bkamins commented Jun 23, 2019

#1840 is merged. Anyone interested please comment on this issue as this is going to be a part of the last big PR before the next release.

@nalimilan
Copy link
Member

I'm fine with that. getcol and setcol will finally allow us to deprecate df[col], making the design fully consistent. FWIW, JuliaDB also uses (a non-mutating) setcol.

@bkamins
Copy link
Member

bkamins commented Jun 24, 2019

I understand we want getcol(::AbstractDataFrame, ::ColumnIndex and setcol!(::AbstractDataFrame, ::ColumnIndex, column). In particular in setcol! I would make column a last argument (like in setproperty! and unlike in setindex!). Then we can deprecate df[col] and df[cols] for getindex.

For setindex! we have the following things we have to think about.

PROBLEM 1 The only consequence is that we should also then remove df[col] .= something and the problem is that setcol! will be not able to do broadcasting when a new column is requested to be created or we want to do broadcasted overwrite of an old column (not broadcast into-it).

PROBLEM 2 The general syntax df[cols] = something. Currently we support:

# df[MultiColumnIndex] = DataFrame
# df[MultiColumnIndex] = AbstractVector (REPEATED FOR EACH COLUMN)
# df[MultiColumnIndex] = Single Item (REPEATED FOR EACH COLUMN; EXPANDS TO NROW(df) if NCOL(df) > 0)

Again broadcasting will handle all these via df[:, cols] .= these_values, but the problem is when the user either:

  • wants to create a new column this way (not a big problem)
  • wants to overwrite an old column with a new column (not to broadcast into-it)

@bkamins
Copy link
Member

bkamins commented Jun 24, 2019

I was in a bit of rush, when writing last comment. So to make sure all is clear let me expand on it using an example.

PROBLEM 1

df = DataFrame(x = 1:4)
df[:y] .= 1 # how to achieve this with `setcol!` only is problematic; ; you have to use fill or something similar
df[:, :x] .= "aaa" # this will fail, as we try to write in-place to a vector of integers
df[:x] = "aaa" # this overwrites column :x, again with `setcol!` only achieving this is problematic; you have to use fill or something similar

PROBLEM 2

df = DataFrame(x = 1:4)
df[[:y]] = 1:4 # it is problematic how to achieve this 
df[:, [:x]] .= 'a':'d' # this will work but converts Char to Int - not what one wanted probably in most of the cases
df[[:x]] = 'a':'d' # again - do we want to support this?

So in summary the problems are:

  • new column creation
  • replacing of old columns (as opposed of writing into them which is not problematic)

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 24, 2019

df[:, :x] and df[:, [:x]],
are out of scope for setcol! and getcol
Those are dataframe slice indexing, which just so happen to refer to a single column.

df[:, [:x]] is particluarly out of scope since it doesn't even refer to a Vector for getindex.

df[:y] .= 1 is actually not a setcol assuming it mutates the contents of the vector
It is a getcol(df, y) .= 1

df[[:x]] feels pretty weird, not sure what people would be doing with it in practice

df[:x] = "aaa" doing df[:x] = fill("aaa", nrows(df)) is not nesc behavour we want to keep.
It is a bit confusing as a kind of "broadcast without dot".
I could see the argument if we had any similar behavour with matrixes,
but e.g. [[1 2; 3 4] 0] does not add a column of zeros.
For that one needs [[1 2; 3 4] zeros(2)]
So it is probably ok that we also require the external use of fill.
Setting a whole column is not super common anyway (This is not to say it never occurs, jut that it is not super common. It definately does occur, when you plan to mutate it later)

df[:, :x] .= "aaa" # this will fail, as we try to write in-place to a vector of integers

That is a feature.

df[:, [:x]] .= 'a':'d' # this will work but converts Char to Int - not what one wanted probably in most of the cases

Again seems like a feature,
or at very least special casing this somehow would put us out of sync with the rest of the language.

@bkamins
Copy link
Member

bkamins commented Jun 24, 2019

I have EDITS below to reflect recent discussions on the whole design of indexing.
The general idea is that df.col and df[:, col] should be synonyms as in the case of getting columns.

@oxinabox I agree (I think 😄). I was writing using current not target specification. Also note that we have to design a consistent system that supports all cases i.e. single or multiple rows and columns that is why I have included some things that are out of scope of this specific issue, but are related.

In target parlance we could have the following approach:

Assumption. We want to get rid of:

  • df[col] = value disallowed
  • df[cols] = value diallowed
  • df[col] .= value disallowed
  • df[cols] .= value diallowed

Single column setting:

  • df.col .= value, df.col = vector df[:, col] .= value and df[:, col] = vector to replace an existing column with a freshly allocated column or create a new column (except for df.col .= value which cannot create columns)
  • setcol!(df, col, value; copycols=false) replace or add a single column (with no-copying by default)
  • df[rows, col] = vector, df[rows, col] .= value, getcol(df, col)[:] = vector and getcol(df, col)[:] .= scalar to change the column in-place (we essentially drop down to a single column - this is a bit messy, but I think this operation is not needed often given the other options we have)

The key issue is that we start making a difference that rows is not : but something else that select rows. : is special as we define df.col and df[:, col] to be synonyms. In general : leads to column replacement by freshly allocated columns and rows is an in-place operation.

Now the rules extend to the following cases for multiple columns:

  • df[:, cols] .= value and df[:, cols] = value - replace or create columns with freshly allocated columns (I would have to implement it for df[:, cols] .= value as column creation for multiple columns was not implemented earlier)
  • df[rows, cols] .= value and df[rows, cols] = value - in place operation
  • to push multiple columns with no-copying by default `setcol!(df, col, value; copycols=false) would have to be used repeatedly (this is not convenient, but I think we do not have to be convenient here)

@nalimilan
Copy link
Member

create a new column filled with fixed values is achieved by df.x = fill(1, nrow(df)); this is a bit problematic as it copies the column by default; in order to avoid copying you need to write setcol!(df, x, fill(1, nrow(df)), copycols=false); as noted above we can do it via df[:, :x] .= 1 and create column :x if we decide to support such operation; (note that fill is not always an appropriate way to create this column unfortunately, e.g. when you would want a categorical array)

I'm fine with df[:, :x] .= 1 allocating a new column if it doesn't exist. That's a bit inconsistent since that syntax would be in-place if the column already exists, but that's not the end of the world (since otherwise that would be an error anyway).

We should also make it easier to create a categorical array filled with a value, e.g. with something like fill(catvalue(...), ...) (catvalue would need a better name, maybe it could just be a special constructor for CategoricalValue).

replace (create a new vector) an existing column filled with fixed values; the same as above, except that we cannot use df[:, :x] .= 1 even if we decide to support it for the case above

That's the most problematic case. Maybe we can try without any special support for that, assuming that in most cases .= will be enough. If that turns out not to be the case, we could reintroduce support for df.col = value without breaking anything.

replace multiple columns in a data frame (you would have to write an iteration that does it); in-place replacement is of course achievable by df[:, cols] .= value.

It sounds OK not to support this.

@bkamins
Copy link
Member

bkamins commented Jun 26, 2019

I have updated the proposed rules in the post above given the discussion related also to getcol in the other issue.

@nalimilan
Copy link
Member

After discussing this again, it appears that it's difficult to get all the behaviors we want with only df[rows, col] = ... and df[rows, col] .= ... without special-casing : and/or being slightly inconsistent with how matrices work:

  • df[rows, col] = ... has to be in-place when rows is a vector of indices, but when it's : we would want it to allow adding/replacing columns.
  • df[rows, col] .= ... should also be in place for consistency with matrices and for efficiency, but then it's slightly weird that it also allocates a new column when ... is a scalar and col doesn't exist yet.

So it doesn't look like = and .= can offer all the combinations we need between in-place vs replacing and broadcasting vs not broadcasting: the existence of df[col] is useful to provide that additional flexibility.

An interesting rule to follow could be that df[rows, col] treats data frames like two-dimensional collections or matrices: df[rows, col] = ... and df[rows, col] .= ... wouldn't be allowed to resize nor reallocate columns in df. Another syntax, operating on a whole column, would allow doing that. If we don't want to keep df[col], df[!, col] (which is valid syntax) or something similar would be a possibility. It could also allow accessing the column without making a copy, with ! indicating that mutating the vector will mutate the parent data frame.

@bkamins
Copy link
Member

bkamins commented Jun 26, 2019

Just to clarify one thing. Do we want df[!, col] = vector to copy vector on assignment (I understand yes, then we still need setcol! to have a non-copying variant)

@bkamins
Copy link
Member

bkamins commented Jun 26, 2019

In order to wrap around my thinking about this problem I have written down what can be done with DataFrame and single column indexing. (pending are SubDataFrame and multi-column indexing, but this is complex enough). I have written what is my thinking that we could do given comments of @nalimilan, @oxinabox and @quinnj and what is the current behavior for the reference. Feel free to add your columns with your preferred syntax:

# operation @bkamins @nalimilan @bkamins 2 current
1 get a column df.col, df[!, col] df.col, df[!, col] df.col, df[!, col] df.col, df[col]
2 get a copy of a column df[:, col] df[:, col] df[:, col] same
3 get a subset of a column df[rows, col] df[rows, col] df[rows, col] same
4 get a view of a column view(df, :, col); view(df, !, col) view(df, :, col); view(df, !, col) view(df, :, col); view(df, !, col) same
3 get a view of a subset of a column view(df, rows, col) view(df, rows, col) view(df, rows, col) same
5 get a value from a column df[row, col] df[row, col] df[row, col] same
6 get a view of a value from a column view(df, row, col) view(df, row, col) view(df, row, col) same
7 set a non-existing column with vector copying df.col = vector; df[:, col] = vector; df[:, col] .= vector, setcol!(df, col, vector, copycols=true) df.col = vector; df[:, col] = vector; df[:, col] .= vector, setcol!(df, col, vector, copycols=true) df.col = vector; df[:, col] = vector; df[:, col] .= vector; df[!, col] .= vector df[col] .= vector
8 set a non-existing column with vector non-copying df[!, col] = vector; setcol!(df, col, vector, copycols=false) df[!, col] = vector; setcol!(df, col, vector, copycols=false) df[!, col] = vector df.col = vector; df[col] = vector; df[:, col] = vector
9 set a non-existing column with scalar df[:, col] .= scalar; df[!, col] .= scalar df[:, col] .= scalar; df[!, col] .= scalar (*) df[:, col] .= scalar; df[!, col] .= scalar df[col] = scalar; df[:, col] = scalar; df[:, col] .= scalar; df[!, col] .= scalar
10 set an existing column with vector replace copying df.col = vector; setcol!(df, col, vector, copycols=true) df.col = vector; df[:, col] = vector; df[:, col] .= vector (*); setcol!(df, col, vector, copycols=true) df.col = vector; df[:, col] = vector; df[:, col] .= vector x
11 set an existing column with vector replace non-copying df[!, col] = vector; setcol!(df, col, vector, copycols=false) df[!, col] = vector; setcol!(df, col, vector, copycols=false) df[!, col] = vector df[col] = vector; df[:, col] = vector; df.col = vector
12 set an existing column with vector in-place df[:, col] .= vector; df[:, col] = vector; df.col[:] = vector; df.col .= vector df[!, col] .= vector (*); df.col[:] = vector; df.col .= vector df[!, col] .= vector; df.col[:] = vector; df.col .= vector df[col] .= vector; df[:, col] .= vector
13 set an existing column with scalar replace df[!, col] .= scalar df[:, col] .= scalar df[:, col] .= scalar df[:, col] = scalar; df[col] = scalar; df.col = scalar
14 set an existing column with scalar in-place df[:, col] .= scalar; df.col .= scalar df[!, col] .= scalar; df.col .= scalar df[!, col] .= scalar; df.col .= scalar df[col] .= scalar; df.col .= scalar; df[:, col] .= scalar
15 set an existing column subset with vector in-place df[rows, col] = vector; df[rows, col] .= vector df[rows, col] = vector; df[rows, col] .= vector df[rows, col] = vector; df[rows, col] .= vector same
16 set an existing column subset scalar in-place df[rows, col] .= scalar df[rows, col] .= scalar df[rows, col] .= scalar df[rows, col] .= scalar; df[rows, col] = scalar

(*) could also be an error as another syntax exists for this operation and may not completely consistent with other uses.

With "@bkamins 2" option we can avoid having to add setcol! function. Also in this scenario we could disallow df[!, col] in broadcasting with little harm.

@pdeffebach
Copy link
Contributor

As the person who has been making a big fuss about the deprecation of df[col] = x I really like df[!, col] and df[:, col].

@bkamins
Copy link
Member

bkamins commented Jun 28, 2019

OK - it seems we are converging with @nalimilan with our proposals. Simplifying: df[!, col] is a drop in replacement for old df[col].

The only decisions are:

  • do we want to allow df[!, col] to be allowed in broadcasting (here the issue is mainly simplicity of API, but it is doable); EDIT if we decide that df[!, col] is essentially the same as old df[col] then defining broadcasting over it is not that problematic
  • there is a slight inconvenience that df.col is the same as df[!, col] for getindex type of action and like df[:, col] for setindex! type of action; this is a bit inconsistent, but follows what was voiced by @oxinabox (setindex! behavior) and @quinnj (getindex behavior). It would be better not to have this duality as it will be confusing for people (so probably we should decide on the one or the other).

@bkamins
Copy link
Member

bkamins commented Jul 12, 2019

After discussions #1866 takes an approach that setproperty is equivalent to df[!, col] = v which DOES NOT copy a vector.

The general rule (that unfortunately has to be learned by users) is that single column operations do not copy as this is typically what is expected (although admittedly this is slightly unsafe).

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

I am closing this. Please reopen if you feel this needs to be discussed (but I feel we are settled that we do not implement this request).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants