sim_mutations and garbage collection #1946

mufernando · 2021-12-10T06:29:03Z

I'm pretty sure sim_mutations is accumulating some garbage same as pyslim (see #pyslim211). Note how sim_mutations also calls ts.tables.

I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2021-12-10T09:29:51Z

There shouldn't be any interaction between ts.tables and garbage collection @mufernando --- it's just making a copy now where we want it to return a view.

jeromekelleher · 2021-12-10T09:31:33Z

Here's an example:

import msprime

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Then,

mprof run sim_mutations_mem.py
mprof plot -o mutations-mem.png

gives

Looks pretty solid to me?

benjeffery · 2021-12-10T11:43:56Z

Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example.

mufernando · 2021-12-10T17:55:55Z

I ran the same example (with a msprime tree sequence) and didn't run into issues.

Then I ran a minimal example with three sim_mutation calls to one of my tree sequences (from SLiM, so with a ton of metadata everywhere). You can see the memory usage only grows over time.

Lastly, I added some manual garbage collection in between sim_mutation calls and that seems to help quite a bit.

With some number of objects being accumulated in between calls:

Before: collected 0 objects.
After first call: collected 84 objects.
After second call: collected 84 objects.
After third call: collected 84 objects.

This was my script:

import tskit
import msprime
import gc

ts = tskit.load("../../output/E082AD5JLBXZ4U6/great-apes_E082AD5JLBXZ4U6_rep0.trees")

collected = gc.collect()
print("Before: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1, keep=False)
collected = gc.collect()
print("After first call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After second call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After third call: collected", collected, "objects.", flush=True)

and I can share the tree sequence, but it's 250M.

petrelharp · 2021-12-10T18:10:59Z

ping @molpopgen?

molpopgen · 2021-12-10T19:14:30Z

Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando.

jeromekelleher · 2021-12-10T19:16:30Z

Very nice, thanks @mufernando

molpopgen · 2021-12-10T19:36:22Z

import tskit
import msprime

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment.

molpopgen · 2021-12-10T19:48:52Z

For completeness, I repeated my example w/gc.collect() after each bout of mutating.

andrewkern · 2021-12-10T19:52:05Z

is this call in sim_mutations the issue?

msprime/msprime/mutations.py

Line 1369 in c034a5d

lwt.fromdict(tables.asdict())

molpopgen · 2021-12-11T03:10:47Z

The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue?

jeromekelleher · 2021-12-11T06:23:29Z

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

jeromekelleher · 2021-12-11T06:32:39Z

I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step.

molpopgen · 2021-12-11T08:12:53Z

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs.

jeromekelleher · 2021-12-11T20:00:57Z

Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile:

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

molpopgen · 2021-12-11T20:58:36Z

These experiments also gave a flat profile:

import msprime
import numpy as np
import tskit

ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()

# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
#    tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})

muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
    muts.add_row(
        site=m.site,
        node=m.node,
        derived_state=m.derived_state,
        parent=m.parent,
        time=m.time,
        metadata={"x": [1, 2, 3, 4]},
    )

tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
    site=muts.site,
    node=muts.node,
    derived_state=muts.derived_state,
    derived_state_offset=muts.derived_state_offset,
    parent=muts.parent,
    time=muts.time,
    metadata=muts.metadata,
    metadata_offset=muts.metadata_offset,
)

# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

molpopgen · 2021-12-11T23:22:17Z

Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use.

molpopgen · 2021-12-13T23:39:23Z

After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called ll_table, many copies of various mutation, individual, etc., tables. When I modified the script to gc.collect() right after load and then run:

mprof run mutatification.py |grep -c MutationTable

I got a count of 30. That seems way too many?

The script:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

gc.collect()
for i in gc.garbage:
    print(i)

molpopgen · 2021-12-13T23:44:04Z

This version gives a count of 6 mutation tables:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

gc.collect()

for j in range(1):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

del ts
del mts

gc.collect()
for i in gc.garbage:
    print(i)

jeromekelleher · 2021-12-14T08:06:22Z

Hmmmmm....

molpopgen · 2021-12-14T12:40:58Z

Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW.

jeromekelleher · 2021-12-14T13:20:49Z

Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: tskit-dev/tskit#2057

molpopgen · 2021-12-14T14:19:55Z

I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one?

jeromekelleher · 2021-12-14T15:36:17Z

No but I figure if gc finds things in simple cases like this that's evidence they exist?

molpopgen · 2021-12-14T15:53:36Z

I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess.

jeromekelleher · 2024-07-17T12:42:23Z

I think this was resolved upstream in tskit

jeromekelleher added the Performance label Apr 5, 2022

jeromekelleher added this to the 1.3.0 milestone Apr 5, 2022

molpopgen mentioned this issue Jul 21, 2022

Referring to tables in TableCollection leaks memory tskit-dev/tskit#2428

Closed

jeromekelleher closed this as completed Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sim_mutations and garbage collection #1946

sim_mutations and garbage collection #1946

mufernando commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

benjeffery commented Dec 10, 2021

mufernando commented Dec 10, 2021

petrelharp commented Dec 10, 2021

molpopgen commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

molpopgen commented Dec 10, 2021

molpopgen commented Dec 10, 2021

andrewkern commented Dec 10, 2021

molpopgen commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

molpopgen commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

molpopgen commented Dec 11, 2021

molpopgen commented Dec 11, 2021 •

edited

Loading

molpopgen commented Dec 13, 2021

molpopgen commented Dec 13, 2021 •

edited

Loading

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021 •

edited

Loading

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021

jeromekelleher commented Jul 17, 2024

sim_mutations and garbage collection #1946

sim_mutations and garbage collection #1946

Comments

mufernando commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

benjeffery commented Dec 10, 2021

mufernando commented Dec 10, 2021

petrelharp commented Dec 10, 2021

molpopgen commented Dec 10, 2021

jeromekelleher commented Dec 10, 2021

molpopgen commented Dec 10, 2021

molpopgen commented Dec 10, 2021

andrewkern commented Dec 10, 2021

molpopgen commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

molpopgen commented Dec 11, 2021

jeromekelleher commented Dec 11, 2021

molpopgen commented Dec 11, 2021

molpopgen commented Dec 11, 2021 • edited Loading

molpopgen commented Dec 13, 2021

molpopgen commented Dec 13, 2021 • edited Loading

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021 • edited Loading

jeromekelleher commented Dec 14, 2021

molpopgen commented Dec 14, 2021

jeromekelleher commented Jul 17, 2024

molpopgen commented Dec 11, 2021 •

edited

Loading

molpopgen commented Dec 13, 2021 •

edited

Loading

molpopgen commented Dec 14, 2021 •

edited

Loading