Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve nonunique #1824

Closed
wants to merge 2 commits into from
Closed

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 25, 2019

A small PR with the following changes in nonunique:

  • return BitVector instead of Vector{Bool}
  • correctly handle passing true as cols (it should error)
  • make the operation faster for DataFrame (as in the old implementation we were performing an unnecessary copy)
  • improve code coverage

@inbounds for g_row in gslots
(g_row > 0) && (res[g_row] = false)
end
return res
end

nonunique(df::AbstractDataFrame, cols::Union{Integer, Symbol}) = nonunique(df[[cols]])
nonunique(df::AbstractDataFrame, cols::Any) = nonunique(df[cols])
nonunique(df::AbstractDataFrame, cols) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old approach looked more Julian to me (use dispatch). Also, better use long function form for multi-line functions.

@@ -1567,3 +1567,5 @@ end
function permutecols!(df::DataFrame, p::AbstractVector{Symbol})
permutecols!(df, index(df)[p])
end

nonunique(df::DataFrame, cols) = nonunique(select(df, cols, copycols=false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better implement select for all existing AbstractDataFrame types? Then we wouldn't need this special case.

@@ -871,15 +871,22 @@ function nonunique(df::AbstractDataFrame)
gslots = row_group_slots(ntuple(i -> df[i], ncol(df)), Val(true))[3]
# unique rows are the first encountered group representatives,
# nonunique are everything else
res = fill(true, nrow(df))
res = trues(nrow(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a bit slower? There may be a speed-memory tradeoff, and in general we tend to favor speed since we assume the data frame stores many columns so it doesn't really matter to allocate another one temporarily.

@bkamins
Copy link
Member Author

bkamins commented Jun 6, 2019

TODO note:
also completecases needs fixing as now it does an unnecessary copy internally.

This will be added after Not is added and select is implemented for all AbstractDataFrame types.

@bkamins
Copy link
Member Author

bkamins commented Jun 10, 2019

I will close this once I move all functionality to #1847

@bkamins
Copy link
Member Author

bkamins commented Jun 16, 2019

This PR is now redundant. It is covered by #1847.

@bkamins bkamins closed this Jun 16, 2019
@bkamins bkamins deleted the avoid_copying_dataframe branch June 16, 2019 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants