Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim_mutations and garbage collection #1946

Closed
mufernando opened this issue Dec 10, 2021 · 26 comments
Closed

sim_mutations and garbage collection #1946

mufernando opened this issue Dec 10, 2021 · 26 comments
Milestone

Comments

@mufernando
Copy link
Member

I'm pretty sure sim_mutations is accumulating some garbage same as pyslim (see #pyslim211). Note how sim_mutations also calls ts.tables.

I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?

@jeromekelleher
Copy link
Member

There shouldn't be any interaction between ts.tables and garbage collection @mufernando --- it's just making a copy now where we want it to return a view.

@jeromekelleher
Copy link
Member

Here's an example:

import msprime

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Then,

mprof run sim_mutations_mem.py
mprof plot -o mutations-mem.png

gives

mutations-mem

Looks pretty solid to me?

@benjeffery
Copy link
Member

Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example.

@mufernando
Copy link
Member Author

I ran the same example (with a msprime tree sequence) and didn't run into issues.

mutations-mem

Then I ran a minimal example with three sim_mutation calls to one of my tree sequences (from SLiM, so with a ton of metadata everywhere). You can see the memory usage only grows over time.

mutations-slim

Lastly, I added some manual garbage collection in between sim_mutation calls and that seems to help quite a bit.

mutations-slim-gc

With some number of objects being accumulated in between calls:

Before: collected 0 objects.
After first call: collected 84 objects.
After second call: collected 84 objects.
After third call: collected 84 objects.

This was my script:

import tskit
import msprime
import gc

ts = tskit.load("../../output/E082AD5JLBXZ4U6/great-apes_E082AD5JLBXZ4U6_rep0.trees")

collected = gc.collect()
print("Before: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1, keep=False)
collected = gc.collect()
print("After first call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After second call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After third call: collected", collected, "objects.", flush=True)

and I can share the tree sequence, but it's 250M.

@petrelharp
Copy link
Contributor

ping @molpopgen?

@molpopgen
Copy link
Member

Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando.

@jeromekelleher
Copy link
Member

Very nice, thanks @mufernando

@molpopgen
Copy link
Member

import tskit
import msprime

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

mutations-mem

Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment.

@molpopgen
Copy link
Member

For completeness, I repeated my example w/gc.collect() after each bout of mutating.

mutations-mem2

@andrewkern
Copy link
Member

is this call in sim_mutations the issue?

lwt.fromdict(tables.asdict())

@molpopgen
Copy link
Member

The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue?

@jeromekelleher
Copy link
Member

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

@jeromekelleher
Copy link
Member

I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step.

@molpopgen
Copy link
Member

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs.

@jeromekelleher
Copy link
Member

Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile:

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

@molpopgen
Copy link
Member

These experiments also gave a flat profile:

import msprime
import numpy as np
import tskit

ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()

# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
#    tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})

muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
    muts.add_row(
        site=m.site,
        node=m.node,
        derived_state=m.derived_state,
        parent=m.parent,
        time=m.time,
        metadata={"x": [1, 2, 3, 4]},
    )

tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
    site=muts.site,
    node=muts.node,
    derived_state=muts.derived_state,
    derived_state_offset=muts.derived_state_offset,
    parent=muts.parent,
    time=muts.time,
    metadata=muts.metadata,
    metadata_offset=muts.metadata_offset,
)

# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

@molpopgen
Copy link
Member

molpopgen commented Dec 11, 2021

Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use.

@molpopgen
Copy link
Member

After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called ll_table, many copies of various mutation, individual, etc., tables. When I modified the script to gc.collect() right after load and then run:

mprof run mutatification.py |grep -c MutationTable

I got a count of 30. That seems way too many?

The script:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

gc.collect()
for i in gc.garbage:
    print(i)

@molpopgen
Copy link
Member

molpopgen commented Dec 13, 2021

This version gives a count of 6 mutation tables:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

gc.collect()

for j in range(1):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

del ts
del mts

gc.collect()
for i in gc.garbage:
    print(i)

@jeromekelleher
Copy link
Member

Hmmmmm....

@molpopgen
Copy link
Member

Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW.

@jeromekelleher
Copy link
Member

Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: tskit-dev/tskit#2057

@molpopgen
Copy link
Member

molpopgen commented Dec 14, 2021

I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one?

@jeromekelleher
Copy link
Member

No but I figure if gc finds things in simple cases like this that's evidence they exist?

@molpopgen
Copy link
Member

I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess.

@jeromekelleher
Copy link
Member

I think this was resolved upstream in tskit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants