-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sim_mutations and garbage collection #1946
Comments
There shouldn't be any interaction between |
Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example. |
I ran the same example (with a msprime tree sequence) and didn't run into issues. Then I ran a minimal example with three Lastly, I added some manual garbage collection in between With some number of objects being accumulated in between calls:
This was my script:
and I can share the tree sequence, but it's 250M. |
ping @molpopgen? |
Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando. |
Very nice, thanks @mufernando |
import tskit
import msprime
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
for j in range(5):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1) Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment. |
is this call in Line 1369 in c034a5d
|
The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue? |
Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow! |
I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step. |
Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs. |
Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile: ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()
for j in range(500):
mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1) |
These experiments also gave a flat profile: import msprime
import numpy as np
import tskit
ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()
# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
# tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})
muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
muts.add_row(
site=m.site,
node=m.node,
derived_state=m.derived_state,
parent=m.parent,
time=m.time,
metadata={"x": [1, 2, 3, 4]},
)
tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
site=muts.site,
node=muts.node,
derived_state=muts.derived_state,
derived_state_offset=muts.derived_state_offset,
parent=muts.parent,
time=muts.time,
metadata=muts.metadata,
metadata_offset=muts.metadata_offset,
)
# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()
for j in range(500):
mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1) |
Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use. |
After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called mprof run mutatification.py |grep -c MutationTable I got a count of 30. That seems way too many? The script: import gc
import sys
import tskit
import msprime
gc.set_debug(gc.DEBUG_SAVEALL)
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
for j in range(5):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)
gc.collect()
for i in gc.garbage:
print(i) |
This version gives a count of 6 mutation tables: import gc
import sys
import tskit
import msprime
gc.set_debug(gc.DEBUG_SAVEALL)
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
gc.collect()
for j in range(1):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)
del ts
del mts
gc.collect()
for i in gc.garbage:
print(i) |
Hmmmmm.... |
Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW. |
Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: tskit-dev/tskit#2057 |
I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one? |
No but I figure if gc finds things in simple cases like this that's evidence they exist? |
I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess. |
I think this was resolved upstream in tskit |
I'm pretty sure
sim_mutations
is accumulating some garbage same as pyslim (see #pyslim211). Note howsim_mutations
also callsts.tables
.I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?
The text was updated successfully, but these errors were encountered: