Improve nonunique #1824

bkamins · 2019-05-25T17:31:41Z

A small PR with the following changes in nonunique:

return BitVector instead of Vector{Bool}
correctly handle passing true as cols (it should error)
make the operation faster for DataFrame (as in the old implementation we were performing an unnecessary copy)
improve code coverage

nalimilan · 2019-05-31T20:57:58Z

src/abstractdataframe/abstractdataframe.jl

    @inbounds for g_row in gslots
        (g_row > 0) && (res[g_row] = false)
    end
    return res
 end

-nonunique(df::AbstractDataFrame, cols::Union{Integer, Symbol}) = nonunique(df[[cols]])
-nonunique(df::AbstractDataFrame, cols::Any) = nonunique(df[cols])
+nonunique(df::AbstractDataFrame, cols) =


The old approach looked more Julian to me (use dispatch). Also, better use long function form for multi-line functions.

nalimilan · 2019-05-31T21:01:41Z

src/dataframe/dataframe.jl

@@ -1567,3 +1567,5 @@ end
 function permutecols!(df::DataFrame, p::AbstractVector{Symbol})
    permutecols!(df, index(df)[p])
 end
+
+nonunique(df::DataFrame, cols) = nonunique(select(df, cols, copycols=false))


Maybe better implement select for all existing AbstractDataFrame types? Then we wouldn't need this special case.

nalimilan · 2019-05-31T21:03:10Z

src/abstractdataframe/abstractdataframe.jl

@@ -871,15 +871,22 @@ function nonunique(df::AbstractDataFrame)
    gslots = row_group_slots(ntuple(i -> df[i], ncol(df)), Val(true))[3]
    # unique rows are the first encountered group representatives,
    # nonunique are everything else
-    res = fill(true, nrow(df))
+    res = trues(nrow(df))


Isn't that a bit slower? There may be a speed-memory tradeoff, and in general we tend to favor speed since we assume the data frame stores many columns so it doesn't really matter to allocate another one temporarily.

bkamins · 2019-06-06T04:27:39Z

TODO note:
also completecases needs fixing as now it does an unnecessary copy internally.

This will be added after Not is added and select is implemented for all AbstractDataFrame types.

bkamins · 2019-06-10T22:13:42Z

I will close this once I move all functionality to #1847

bkamins · 2019-06-16T07:48:20Z

This PR is now redundant. It is covered by #1847.

bkamins added 2 commits May 25, 2019 19:25

improve nonunique

e1015ae

fix forgotten signature cleanup

8e1be16

nalimilan reviewed May 31, 2019

View reviewed changes

nalimilan mentioned this pull request May 31, 2019

select and deletecols for SubDataFrame and DataFrameRow #1825

Closed

bkamins closed this Jun 16, 2019

bkamins deleted the avoid_copying_dataframe branch June 16, 2019 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve nonunique #1824

Improve nonunique #1824

bkamins commented May 25, 2019

nalimilan May 31, 2019

nalimilan May 31, 2019

nalimilan May 31, 2019

bkamins commented Jun 6, 2019

bkamins commented Jun 10, 2019

bkamins commented Jun 16, 2019

Improve nonunique #1824

Improve nonunique #1824

Conversation

bkamins commented May 25, 2019

nalimilan May 31, 2019

Choose a reason for hiding this comment

nalimilan May 31, 2019

Choose a reason for hiding this comment

nalimilan May 31, 2019

Choose a reason for hiding this comment

bkamins commented Jun 6, 2019

bkamins commented Jun 10, 2019

bkamins commented Jun 16, 2019