Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join memory usage workaround issues #3355

Closed
jariji opened this issue Jul 8, 2023 · 4 comments
Closed

Join memory usage workaround issues #3355

jariji opened this issue Jul 8, 2023 · 4 comments
Milestone

Comments

@jariji
Copy link
Contributor

jariji commented Jul 8, 2023

On large dataframes

innerjoin(left, right; on=cols)

runs out of memory so I end up doing

left.leftrow = 1:nrow(left)
right.rightrow = 1:nrow(right)
ix = innerjoin(left[:, vcat(cols, :leftrow)], right[:, vcat(cols, :rightrow)])
joined = hcat(left[ix.leftrow, :], right[ix.rightrow, :])

Two problems arise:

(1) This is slightly inconvenient. It would be nicer if I could do this without all the index hacking.

(2) This doesn't work on leftjoin because ix.rightrow contains missings so right[ix.rightrow, :] fails.

I wonder what you think about these issues.

@bkamins
Copy link
Member

bkamins commented Jul 8, 2023

This is strange (i.e. there must be some memory leak in join implementation). What you do is the same conceptually to what is done in:
https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/composer.jl#L242
and the later operations should allocate less than hcat.

So for sure it should be fixed. If you have an easy reproducer it would help to track down the issue.

@bkamins bkamins added this to the 1.6 milestone Jul 8, 2023
@jariji
Copy link
Contributor Author

jariji commented Jul 8, 2023

If this is unexpected then it's possible I'm misreporting. I don't want to hold up any release so I'll close it until I can reproduce.

@jariji jariji closed this as completed Jul 8, 2023
@bkamins
Copy link
Member

bkamins commented Jul 8, 2023

I run some quick tests and plain innerjoin allocated less than your combination (where I replaced : with ! to reduce memory consumption):

julia> @time innerjoin(left, right, on=:id => :id2);
  3.030537 seconds (424 allocations: 1.128 GiB, 39.96% gc time)

julia> @time ix = innerjoin(left[!, vcat(:id, :leftrow)], right[!, vcat(:id2, :rightrow)], on=:id => :id2);
  2.702270 seconds (243 allocations: 1017.748 MiB, 35.03% gc time)

julia> @time hcat(left[ix.leftrow, :], right[ix.rightrow, :]);
  1.067658 seconds (184 allocations: 365.997 MiB, 75.13% gc time)

@jariji
Copy link
Contributor Author

jariji commented Jul 8, 2023

Sorry for the noise then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants