Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mergeduplicates keyword to handle makeunique=false #3366

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 68 additions & 42 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1536,14 +1536,18 @@ end

"""
hcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)

Horizontally concatenate data frames.

If `makeunique=false` (the default) column names of passed objects must be unique.
If `makeunique=true` then duplicate column names will be suffixed
with `_i` (`i` starting at 1 for the first duplicate).

If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names
If `makeunique=false` and `mergeduplicates` is a `Function` then duplicate column names

will be combined by this function with the column named overwritten by the results of
the function on all values from the duplicated column(s).
Comment on lines +1548 to +1549
Copy link
Member

@bkamins bkamins Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what this sentence means. Probably a native speaker would be better to say how to fix it.

What I can say is how this will work mergeduplicates(mergeduplicates(x,y),z) where x, y and z are consecutive duplicates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I say "ouch". I am very much a native English speaker. That said I've struggled with this wording to make it clearer... more struggle then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following be clearer?

If makeunique=false and mergeduplicates is a Function then duplicate columns
will be combined by invoking the function with all values from those columns.
e.g. mergeduplicates=coalesce will use the first non-missing value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crucial issue is, as I have commented, that the behavior or mergeduplicates will differ depending on the function in which it is used (unless we change the current implementation).

In functions like insertcols or DataFrame it will pass all duplicate columns as consecutive arguments to a SINGLE CALL to mergeduplicates function.

In functions like hcat or *join it will perform a recursive call taking at most two duplicate columns at a time.


If `copycols=true` (the default) then the `DataFrame` returned by `hcat` will
contain copied columns from the source data frames.
If `copycols=false` then it will contain columns as they are stored in the
Expand Down Expand Up @@ -1587,32 +1591,32 @@ julia> df3 = hcat(df1, df2, makeunique=true)
julia> df3.A === df1.A
false

julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3 = hcat(df1, df2, mergeduplicates=:makeunique, copycols=false);
leei marked this conversation as resolved.
Show resolved Hide resolved

julia> df3.A === df1.A
true
```
"""
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, copycols::Bool=true)
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates::Union{Nothing, Function}=nothing, copycols::Bool=true)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking it that you're suggesting that mergeduplicates be typed everywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

df = DataFrame(df, copycols=copycols)
_drop_all_nonnote_metadata!(df)
return df
end

# TODO: after deprecation remove AbstractVector methods
Base.hcat(df::AbstractDataFrame, x::AbstractVector; makeunique::Bool=false, copycols::Bool=true) =
hcat!(DataFrame(df, copycols=copycols), x, makeunique=makeunique, copycols=copycols)
Base.hcat(x::AbstractVector, df::AbstractDataFrame; makeunique::Bool=false, copycols::Bool=true) =
hcat!(x, df, makeunique=makeunique, copycols=copycols)
Base.hcat(df::AbstractDataFrame, x::AbstractVector; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
hcat!(DataFrame(df, copycols=copycols), x, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)
Base.hcat(x::AbstractVector, df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
hcat!(x, df, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)
Base.hcat(df1::AbstractDataFrame, df2::AbstractDataFrame;
makeunique::Bool=false, copycols::Bool=true) =
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
hcat!(DataFrame(df1, copycols=copycols), df2,
makeunique=makeunique, copycols=copycols)
makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)
Base.hcat(df::AbstractDataFrame, x::Union{AbstractVector, AbstractDataFrame},
y::Union{AbstractVector, AbstractDataFrame}...;
makeunique::Bool=false, copycols::Bool=true) =
hcat!(hcat(df, x, makeunique=makeunique, copycols=copycols), y...,
makeunique=makeunique, copycols=copycols)
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
hcat!(hcat(df, x, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols), y...,
makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)

"""
vcat(dfs::AbstractDataFrame...;
Expand Down Expand Up @@ -2868,8 +2872,11 @@ const INSERTCOLS_ARGUMENTS =
are unwrapped and treated in the same way
- `after` : if `true` columns are inserted after `col`
- `makeunique` : defines what to do if `name` already exists in `df`;
if it is `false` an error will be thrown; if it is `true` a new unique name will
be generated by adding a suffix
if it is `true` a new unique name will be generated by adding a suffix,
if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided.
- `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique`
is false. It should be given a Function that combines the values of all of the duplicated
columns which will be passed as a varargs. The return value is used.
- `copycols` : whether vectors passed as columns should be copied

If `val` is an `AbstractRange` then the result of `collect(val)` is inserted.
Expand All @@ -2891,7 +2898,7 @@ const INSERTCOLS_ARGUMENTS =

"""
insertcols(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)

Insert a column into a copy of `df` data frame using the [`insertcols!`](@ref)
function and return the newly created data frame.
Expand Down Expand Up @@ -2922,7 +2929,7 @@ julia> insertcols(df, 1, :b => 'a':'c')
2 │ b 2
3 │ c 3

julia> insertcols(df, :c => 2:4, :c => 3:5, makeunique=true)
julia> insertcols(df, :c => 2:4, :c => 3:5, mergeduplicates=nothing)
leei marked this conversation as resolved.
Show resolved Hide resolved
3×3 DataFrame
Row │ a c c_1
│ Int64 Int64 Int64
Expand All @@ -2942,13 +2949,13 @@ julia> insertcols(df, :a, :d => 7:9, after=true)
```
"""
insertcols(df::AbstractDataFrame, args...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true) =
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
insertcols!(copy(df), args...;
after=after, makeunique=makeunique, copycols=copycols)
after=after, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)

"""
insertcols!(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)

Insert a column into a data frame in place. Return the updated data frame.

Expand Down Expand Up @@ -2979,7 +2986,7 @@ julia> insertcols!(df, 1, :b => 'a':'c')
2 │ b 2
3 │ c 3

julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, makeunique=true)
julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, mergeduplicates=nothing)
3×4 DataFrame
Row │ b c c_1 a
│ Char Int64 Int64 Int64
Expand All @@ -2999,7 +3006,10 @@ julia> insertcols!(df, :b, :d => 7:9, after=true)
```
"""
function insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{Symbol}...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)

_check_makeunique_args(mergeduplicates, makeunique)

if !is_column_insertion_allowed(df)
throw(ArgumentError("insertcols! is only supported for DataFrame, or for " *
"SubDataFrame created with `:` as column selector"))
Expand All @@ -3025,15 +3035,15 @@ function insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{Sy
"$(ncol(df)) columns at index $col_ind"))
end

if !makeunique
if !makeunique && isnothing(mergeduplicates)
if !allunique(first.(name_cols))
throw(ArgumentError("Names of columns to be inserted into a data frame " *
"must be unique when `makeunique=true`"))
"must be unique when `mergeduplicates=nothing`"))
end
for (n, _) in name_cols
if hasproperty(df, n)
throw(ArgumentError("Column $n is already present in the data frame " *
"which is not allowed when `makeunique=true`"))
"which is not allowed when `mergeduplicates=nothing`"))
end
end
end
Expand Down Expand Up @@ -3067,6 +3077,7 @@ function insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{Sy
target_row_count = 1
end

mergecolumns = Dict{Symbol, Any}()
start_col_ind = col_ind
for (name, item) in name_cols
if !(item isa AbstractVector)
Expand Down Expand Up @@ -3103,23 +3114,38 @@ function insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{Sy
dfp[!, name] = item_new
else
if hasproperty(dfp, name)
@assert makeunique
k = 1
while true
nn = Symbol("$(name)_$k")
if !hasproperty(dfp, nn)
name = nn
break
if makeunique
k = 1
while true
nn = Symbol("$(name)_$k")
if !hasproperty(dfp, nn)
name = nn
break
end
k += 1
end
k += 1
insert!(index(dfp), col_ind, name)
insert!(_columns(dfp), col_ind, item_new)
else
# Just update without adding to index
merge = get(mergecolumns, name, (dfp=dfp, cols=[]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you need dfp for here?

push!(merge.cols, item_new)
mergecolumns[name] = merge
col_ind -= 1
end
else
insert!(index(dfp), col_ind, name)
insert!(_columns(dfp), col_ind, item_new)
end
insert!(index(dfp), col_ind, name)
insert!(_columns(dfp), col_ind, item_new)
end
col_ind += 1
end

# Combine columns using mergeduplicates
for (name, merge) in mergecolumns
merge.dfp[!, name] = mergeduplicates.(merge.dfp[!, name], merge.cols...)
end

delta = col_ind - start_col_ind
colmetadata_dict = getfield(parent(df), :colmetadata)
if !isnothing(colmetadata_dict) && delta > 0
Expand All @@ -3134,22 +3160,22 @@ function insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{Sy
end

insertcols!(df::AbstractDataFrame, col::ColumnIndex, name_cols::Pair{<:AbstractString}...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true) =
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
insertcols!(df, col, (Symbol(n) => v for (n, v) in name_cols)...,
after=after, makeunique=makeunique, copycols=copycols)
after=after, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)

insertcols!(df::AbstractDataFrame, name_cols::Pair{Symbol}...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true) =
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
insertcols!(df, ncol(df)+1, name_cols..., after=after,
makeunique=makeunique, copycols=copycols)
makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)

insertcols!(df::AbstractDataFrame, name_cols::Pair{<:AbstractString}...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true) =
after::Bool=false, makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) =
insertcols!(df, (Symbol(n) => v for (n, v) in name_cols)...,
after=after, makeunique=makeunique, copycols=copycols)
after=after, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=copycols)

function insertcols!(df::AbstractDataFrame, col::ColumnIndex; after::Bool=false,
makeunique::Bool=false, copycols::Bool=true)
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
if col isa SymbolOrString
col_ind = Int(columnindex(df, col))
if col_ind == 0
Expand All @@ -3173,7 +3199,7 @@ function insertcols!(df::AbstractDataFrame, col::ColumnIndex; after::Bool=false,
end

function insertcols!(df::AbstractDataFrame; after::Bool=false,
makeunique::Bool=false, copycols::Bool=true)
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
_drop_all_nonnote_metadata!(parent(df))
return df
end
Expand Down
14 changes: 7 additions & 7 deletions src/abstractdataframe/reshape.jl
Original file line number Diff line number Diff line change
Expand Up @@ -823,7 +823,7 @@ julia> permutedims(df2, 1, "different_name")
"""
function Base.permutedims(df::AbstractDataFrame, src_namescol::ColumnIndex,
dest_namescol::Union{Symbol, AbstractString};
makeunique::Bool=false, strict::Bool=true)
makeunique::Bool=false, mergeduplicates=nothing, strict::Bool=true)

if src_namescol isa Integer
1 <= src_namescol <= ncol(df) || throw(BoundsError(index(df), src_namescol))
Expand Down Expand Up @@ -854,26 +854,26 @@ function Base.permutedims(df::AbstractDataFrame, src_namescol::ColumnIndex,

if ncol(df_notsrc) == 0
df_tmp = DataFrame(AbstractVector[[] for _ in 1:nrow(df)], new_col_names,
makeunique=makeunique, copycols=false)
makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=false)
else
m = permutedims(Matrix(df_notsrc))
df_tmp = rename!(DataFrame(Tables.table(m)), new_col_names, makeunique=makeunique)
end
out_df = hcat!(df_permuted, df_tmp, makeunique=makeunique, copycols=false)
out_df = hcat!(df_permuted, df_tmp, makeunique=makeunique, mergeduplicates=mergeduplicates, copycols=false)
_copy_table_note_metadata!(out_df, df)
return out_df
end

function Base.permutedims(df::AbstractDataFrame, src_namescol::ColumnIndex;
makeunique::Bool=false, strict::Bool=true)
makeunique::Bool=false, mergeduplicates=nothing, strict::Bool=true)
if src_namescol isa Integer
1 <= src_namescol <= ncol(df) || throw(BoundsError(index(df), src_namescol))
dest_namescol = _names(df)[src_namescol]
else
dest_namescol = src_namescol
end
return permutedims(df, src_namescol, dest_namescol;
makeunique=makeunique, strict=strict)
makeunique=makeunique, mergeduplicates=mergeduplicates, strict=strict)
end

function Base.permutedims(df::AbstractDataFrame)
Expand All @@ -883,8 +883,8 @@ function Base.permutedims(df::AbstractDataFrame)
end

function Base.permutedims(df::AbstractDataFrame, cnames::AbstractVector;
makeunique::Bool=false)
out_df = DataFrame(permutedims(Matrix(df)), cnames, makeunique=makeunique)
makeunique::Bool=false, mergeduplicates=nothing)
out_df = DataFrame(permutedims(Matrix(df)), cnames, makeunique=makeunique, mergeduplicates=mergeduplicates)
_copy_table_note_metadata!(out_df, df)
return out_df
end
Loading