Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_nodes for simplify #2619

Merged
merged 2 commits into from
Jan 10, 2023
Merged

Conversation

hyanwong
Copy link
Member

Description

A follow up to #2606 with tests added. Currently intentionally failing until we decide whether to allow the passed-in samples to be anything other than ts.samples(). Will also need to implement this in the C version of simplify, and switch compare_lib=True

Fixes #2605.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@hyanwong hyanwong marked this pull request as draft October 30, 2022 22:03
@hyanwong hyanwong marked this pull request as draft October 30, 2022 22:03
@hyanwong hyanwong marked this pull request as draft October 30, 2022 22:03
@hyanwong hyanwong marked this pull request as draft October 30, 2022 22:03
@hyanwong hyanwong changed the title Simplify filter nodes filter_nodes for simplify Oct 30, 2022
@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@d4efac9). Click here to learn what that means.
The diff coverage is 90.81%.

❗ Current head 4f49c76 differs from pull request most recent head 7ea2b59. Consider uploading reports for the commit 7ea2b59 to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2619   +/-   ##
=======================================
  Coverage        ?   93.92%           
=======================================
  Files           ?       28           
  Lines           ?    27731           
  Branches        ?     1271           
=======================================
  Hits            ?    26046           
  Misses          ?     1648           
  Partials        ?       37           
Flag Coverage Δ
c-tests 92.26% <90.44%> (?)
lwt-tests 89.05% <ø> (?)
python-c-tests 72.29% <85.71%> (?)
python-tests 98.97% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 98.75% <ø> (ø)
c/tskit/tables.c 90.24% <90.44%> (ø)
python/_tskitmodule.c 92.73% <100.00%> (ø)
python/tskit/tables.py 98.96% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4efac9...7ea2b59. Read the comment docs.

@hyanwong
Copy link
Member Author

I just realised that the easiest way to allow any nodes to be marked as samples is to switch the NODE_IS_SAMPLE flags before simplifying. Since we require the node order to remain the same, I think this is all we need to do.

@hyanwong
Copy link
Member Author

I just realised that the easiest way to allow any nodes to be marked as samples is to switch the NODE_IS_SAMPLE flags before simplifying. Since we require the node order to remain the same, I think this is all we need to do.

Actually, it's even easier than that. Recent commit included: I think this works fine:

sample = set(sample)
for node in ts.nodes():
    is_sample = node.id in sample
    self.record_node(node.id, is_sample=is_sample)
    if is_sample:
        self.add_ancestry(node.id, 0, self.sequence_length, node.id)

@hyanwong hyanwong force-pushed the simplify-filter-nodes branch 2 times, most recently from 9da35ab to ac20541 Compare October 31, 2022 11:23
@hyanwong
Copy link
Member Author

I think this now needs to be implemented in the C version of simplify(), right?

@hyanwong hyanwong force-pushed the simplify-filter-nodes branch 6 times, most recently from 0cb7cbc to b32c3bc Compare October 31, 2022 21:25
@hyanwong
Copy link
Member Author

hyanwong commented Oct 31, 2022

I have coded up the C equivalent, but there are 2 remaining issues:

  1. In merge_labeled_ancestors a new line that @jeromekelleher introduced says is_sample = input_id in self.samples # HACK (and I have coded up a C equivalent). The HACK comment worries me slightly.
  2. The default C call will have TSK_SIMPLIFY_FILTER_NODES set to 0, so the defaults will change to not filter nodes, which I think is causing the C tests to fail. This seems like a rather large change in default behaviour. I'm not sure what the best options are here.

@molpopgen
Copy link
Member

I am trying to figure out what this does as I think it collides with something that one may need for (one approach to) parallel simplification.

The comments in #2605 are concerning to me re: marking all nodes as samples. That would completely break the stuff that I am presenting in December by rendering it not possible. (Or not possible w/o another flag to simplify that suppresses the behavior described in #2605, but I'd consider that a serious design failure.)

Could someone provide a simple description of what this is doing -- I can't find one by by tracking the linked issues, etc..

@hyanwong
Copy link
Member Author

hyanwong commented Oct 31, 2022

The comments in #2605 are concerning to me re: marking all nodes as samples.

I don't think we mark all nodes as samples. The current implementation of simplify() takes a list of nodes, and turns all those nodes into samples, keeping their ancestry WRT the other nodes, which are marked as non-samples. By default, the list of passed-in nodes is simply all the existing samples in the TS, but you could provide a different list of nodes, in which case these become the samples instead, and any nodes in their genetic ancestry are kept but are flagged as non-sample nodes. As far as I know, simplify() has always worked like this.

Could someone provide a simple description of what this is doing -- I can't find one by by tracking the linked issues, etc..

Well, I think the key is here (code lightly edited for clarity)

if filter_nodes:
    for sample_id in passed_in_node_list:
        output_id = self.record_node(sample_id, is_sample=True)
        self.add_ancestry(sample_id, 0, self.sequence_length, output_id)
else:
    for node in ts.nodes():
        is_sample = node.id in passed_in_node_list
        self.record_node(node.id, is_sample=is_sample)
        if is_sample:
           self.add_ancestry(node.id, 0, self.sequence_length, node.id)

So normally (filter_nodes==True) we record the passed-in node list, treating all of these nodes as samples, and add the ancestry of those nodes into the TS. If we don't filter nodes, we record all the nodes in the normal order (meaning that the node IDs don't change), but only the ones passed in are marked as samples, and we retain the ancestry of those. That means a number of the non-sample nodes in the ts may end up not being referenced in the node table.

Does that make sense? Is my explanation too simplistic @molpopgen ? I feel you probably understand this much better than I do anyway.

@molpopgen
Copy link
Member

Yes, I think that this makes sense -- the work "all" in the comment elsewhere was ambiguous to me.

So, if filter nodes is False:

  • all nodes get recorded
  • only sample nodes have their ancestry preserved.

For this to make sense, all nodes that get recorded by have no referents after simplification should be marked as NULL in the output idmap? Is that the case?

@hyanwong
Copy link
Member Author

hyanwong commented Oct 31, 2022

So, if filter nodes is False:

  • all nodes get recorded
  • only sample nodes have their ancestry preserved.

Yes, with the proviso that the sample nodes are defined as the ones passed in to simplify (which by default is all the nodes flagged with tskit.NODE_IS_SAMPLE in the original tree sequence)

For this to make sense, all nodes that get recorded by have no referents after simplification should be marked as NULL in the output idmap? Is that the case?

The idmap doesn't tell you whether the node is used or not, only if the node Id has changed. Even in the case of nodes that have no referents after simplification, the nodes are retained in the table (although not referenced anywhere), and the idmap is simply the identify map. In fact, by definition, the output idmap when filter_nodes=False should be 0..num_nodes-1 and not contain any NULLs (because no nodes are actually removed).

@molpopgen
Copy link
Member

The idmap doesn't tell you whether the node is used or not, only if the node Id has changed.

This isn't quite right -- a node that "simplifies out" of the tables (is not longer part of the ancestry anywhere) has -1/null in the idmap. For what I envision, I require this to be true for this feature, too.

@molpopgen
Copy link
Member

Yes, with the proviso that the sample nodes are defined as the ones passed in to simplify (which by default is all the nodes flagged with tskit.NODE_IS_SAMPLE in the original tree sequence)

This is not what I read in the code that you wrote above -- what is "passed in node list"?

@molpopgen
Copy link
Member

In fact, by definition, the output idmap when filter_nodes=False should be 0..num_nodes-1 and not contain any NULLs (because no nodes are actually removed).

This is the sticking point, and breaks what I need this feature to do. I require a way to: keep the input and output node tables the same length but also to know which nodes are no longer used. The implementation of this that I have done takes this into account by entering NULL into idmap for the latter case.

@molpopgen
Copy link
Member

molpopgen commented Oct 31, 2022

My implementtion is here: https://github.com/molpopgen/tskit-rust/tree/parallel_experiment.

I cannot make it easy to see diffs b/c that would trigger a PR against tskit-rust, but you can look at the commit history for the branch and see the changes to simplification (in the C code)

@hyanwong
Copy link
Member Author

This is not what I read in the code that you wrote above -- what is "passed in node list"?

The user calls ts.simplify(samples=[10,11,12]). The nodes 10, 11, and 12 become the samples in the output tree sequence. If not give, the samples parameter default to ts.samples(), so that the samples in the input tree sequence remain samples in the output tree sequence (albeit with changed node IDs).

I require a way to: keep the input and output node tables the same length but also to know which nodes are no longer used.

I'm sure we could find a way to report this. We could revise the meaning of the idmap in this case, or (I guess) we could search through the edge table (but that's not very performant). @jeromekelleher might have a sensible suggestion here.

My previous hack would work for this, as a matter of fact: do a normal simplify (removing unreferenced nodes) to create a new tree sequence, look at the nodes with NULL in the map, then edit the original tree sequence by porting the edges over from the new tree sequence back into the old one, with a suitable remapping of parent and child IDs. Maybe that's too hack for you, though?

@hyanwong
Copy link
Member Author

I think it could be reasonable to return an id_map with NULL for unused nodes in the filter_nodes=False case, but I don't know if it's actually very easy to implement. I will ask @jeromekelleher .

@petrelharp
Copy link
Contributor

All this sounds great to me. Two suggestions:

  • how about TSK_SIMPLIFY_KEEP_NODES or TSK_SIMPLIFY_KEEP_ALL_NODES to match other flags
  • how about TSK_SIMPLIFY_NO_MARK_SAMPLES as more concise than NO_UPDATE_SAMPLE_FLAGS (and maybe more clear, because why is it "update"?)

@jeromekelleher
Copy link
Member

jeromekelleher commented Dec 12, 2022

I think it's better to keep this as NO_FILTER following the population etc tables because the KEEP flags (in retrospect) are more about keeping edges than nodes. We have:

TSK_SIMPLIFY_FILTER_SITES
TSK_SIMPLIFY_FILTER_POPULATIONS
TSK_SIMPLIFY_FILTER_INDIVIDUALS
TSK_SIMPLIFY_REDUCE_TO_SITE_TOPOLOGY
TSK_SIMPLIFY_KEEP_UNARY
TSK_SIMPLIFY_KEEP_INPUT_ROOTS
TSK_SIMPLIFY_KEEP_UNARY_IN_INDIVIDUALS

We keep the edges, and therefore the corresponding nodes are not filtered.

The trouble with TSK_SIMPLIFY_NO_MARK_SAMPLES is that we also unmark things that are not samples, so UPDATE is clearer I think.

@jeromekelleher
Copy link
Member

OK, this should implement the "don't touch the table pointers" semantics when we don't filter unreferenced nodes, sites, populations or individuals. I had to jiggle things around a bit to do it, but I think it was worth the refactoring. Hopefully it's a bit clearer how the copies are used now and it opens the door for future memory reductions as well.

I still need to document and add the NO_UPDATE_SAMPLE_FLAGS. I'm also hoping to add an example with multiple table collections using the same nodes and individuals, to see convince myself that this all works in practise but we'll see how that goes time-wise.

@jeromekelleher jeromekelleher marked this pull request as ready for review December 14, 2022 16:15
@jeromekelleher
Copy link
Member

I think this is ready for review and merge - I'll implement TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS separately as this PR has ended up getting quite big.

Turns out making this memory safe required quite a bit of fiddling! I think the code it much better now though, so hopefully worth the effort.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I only found a couple of typos

c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Added a few minor things from review from @petrelharp and @benjeffery - ready to go I think!

@jeromekelleher
Copy link
Member

Opened #2676 for the CircleCI build failure (which looks non-trivial)

@benjeffery
Copy link
Member

git pushing here to trigger CI as for some reason circle CI won't let me click the button.

@jeromekelleher
Copy link
Member

@Mergifyio rebase

hyanwong and others added 2 commits January 10, 2023 11:41
Revert to old implementation

Simple tset

Update changelog and add C tests

Done?

Low level tests

More tests

Fix error-handling error in simplifier_init

We were clearing the input tables before checking sample errors

Fixup broken tests

Add test for mutations

Don't clear the node table when not filtering nodes

Refactor

Modernise simplify test

Fix provenance bug

Removed unused samples member

Implement filter-populations with no-touch semantics

updates

Refactor finalise references path

Finished no-touch semantics on the non-filter casese

Remove unused simplify_t struct member

Make simplify thread-safe in no-filter case

Update changelog
@mergify
Copy link
Contributor

mergify bot commented Jan 10, 2023

⚠️ This pull request got rebased on behalf of a random user of the organization.
This behavior will change on the 1st February 2023, Mergify will pick the author of the pull request instead.

To get the future behavior now, you can configure bot_account options (e.g.: bot_account: { author } or update_bot_account: { author }.

Or you can create a dedicated github account for squash and rebase operations, and use it in different bot_account options.

@mergify
Copy link
Contributor

mergify bot commented Jan 10, 2023

rebase

✅ Branch has been successfully rebased

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 10, 2023
@mergify mergify bot merged commit 2b61dfd into tskit-dev:main Jan 10, 2023
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filter_nodes argument to simplify
5 participants