-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accomodate recurrent mutations in branch_stats algorithm #15
Comments
With the most recent merge, we also have to consider back mutations, and eventually > 2 alleles. I'm not sure what the best approach is, but it is something we need to consider. |
Related question: let X and Y be two independent, randomly chosen leaves of a tree, with P(X=Y)=0. Let In the infinite sites model this is just |
Good question @petrelharp, I don't know the answer to that one. I expect in general that many things will be a lot harder when we don't have the nice 1 site == 1 mutation relationship. However, we can test this on a site-by-site basis (each site has a reference to the list of mutations), so that a simple efficient algorithm can be used if the site has a single mutation, and a less efficient algorithm otherwise. I expect most sites will have this single mutation property. I haven't really worked through the consequences of having multiple mutations per site yet, I'm just trying to represent the data that I have at the moment. LD and pi calculations will error out at moment if called on tree sequences with > 1 mutation per site. One thing that I think will help is that I'm currently insisting that mutations must be listed in nonincreasing time order of nodes within a site, so that we can apply the effects of mutations sequentially and be sure that we get the right state at the leaves. Perhaps if we insist that mutations are listed in pre-order instead (i.e., in the order they would be visited in a top-down tree traversal), we would be able to compute things like the expected hamming distance efficiently in the general case. |
I expect in general that many things will be a lot harder when we don't
have the nice 1 site == 1 mutation relationship. However, we can test this
on a site-by-site basis (each site has a reference to the list of
mutations), so that a simple efficient algorithm can be used if the site
has a single mutation, and a less efficient algorithm otherwise. I expect
most sites will have this single mutation property.
That's what I'm thinking, too, if we don't come up with something better.
To be clear: you say 'we can test this' - are you planning to be able to
easily distinguish mutations that have occurred more than once?
|
Well, it's easy to check if a site has more than one mutation; after that, I'm not sure what we can do. The new schema for sites/mutations looks like this:
So, every site has a (unique) position and an ancestral state. In principle this can be any string, but in practise must be a single character. Every mutation happens at a given site, is over a single node (note that this is different to the previous model), and has a derived state. Mutations within a node must be listed in reverse time order (ie., down the tree) so that we can apply their effects sequentially to the leaves and be sure to get the right state. (As I mentioned above, in order of tree traversal might be more useful.) In this case, I'm imaging two recurrent mutations over leaves 0 and 1 at site 0, and a mutation over internal node 3 with a back mutation at leaf 0. In principle now, we can have alternating mutations happening all the way down the tree, so we'll need to reason a bit more carefully about things like allele frequencies. In the biallelic case I don't a think it's too awful (I'm sure there will be a way to efficiently know what the number of 1s at the leaves is), but in the general ACTG case I'm not sure what'll happen. Does the schema make sense to you? Nothing is set in stone yet, so if you think it'll cause problems please do shout out! |
Observation: branches are equivalent to splits (i.e., bipartitions) are equivalent to biallelic sites. Currently, the statistic is defined by
This definition applies perfectly fine to possibly recurrent mutations - the only issue is that we have to do a bit more work to determine what the What about sites with more than one allele? There is not currently a consensus about how to compute single-site statistics like divergence using e.g. triallelic sites, and there's no obvious (simple) best answer, so this definition is as good as any. Concretely, let's take the definition above after replacing "each mutation" by "each allele". So, at sites with This is slightly different than before, because with biallelic sites if we know the
Any statistic that doesn't require knowing the ancestral state should be invariant under changing which allele is 'ancestral', so I think (but haven't checked) that translating previous So, my proposal is to replace the definition of the statistic so that for branch lengths both
How's that sound? |
This sounds good @petrelharp. I don't really know how to deal with multiple alleles properly. In general, just knowing what the number of alleles actually is seems tricky (imagine lots of flip-flopping mutations down along the tree), and it seems that in the worst case it's hard to do anything without a full tree traversal at each site. It really gets pretty nasty, and I have no immediate application for the ACTG alphabet, so I wonder whether it's better to comprehensively deal with the binary case (with arbitrary recurrent/back mutations), and worry about k-alleles later? But, I also worry that we'll have to undo a bunch of stuff and change APIs down the road if we don't tackle k alleles now, at least identifying what the problems are and knowing what the outline solution to them will be. |
I don't think it's that hard, given the information we have at hand? Suppose that for a given site we have Here is an algorithm to obtian genotypes at this site. For each
As for computing statistics: the key is producing, for a given subset In retrospect: I think that your schema will allow us these fairly efficient algorithms, so I'm good with it. I don't see how to improve it (especially as multiple mutations are rare). |
Thanks for looking at this @petrelharp, it's a great help to get your input. It's interesting that your algorithm for generating genotypes goes from leaf-to-root; I would have thought the other way around was the way to do it. Consider the following tree: Suppose we have one site with ancestral state A, and three mutations: [(19, T), (13, C), (14, G)]. We want to compute the frequency of all alleles.
I've ordered the nodes at the site in pre-order here, as I think it makes it easier to reason about what's happening in the current subtree (i.e., here we fully finish up with the subtree below 19 before moving to look at the mutation at 14). It seems to me that there is a bit of efficiency to be gained there because we don't have to look at the entire list of nodes but only the last when we're moving back up the tree to see if we're affecting any existing mutations. There is a slight trickiness when computing the frequency of the alleles when we have back mutations to the ancestral state, but it's a small detail. I think this will work for calculating frequencies within subsets as well. The cost should be O((k - 1) * height of tree) if we have k mutations at a site, since in worst case we traverse up to the root k - 1 times to find ancestral mutations. This probably about as efficient as it's going to get. What do you think, does the algorithm make sense to you? I haven't written it out explicitly here, but hopefully it's fairly clear.
I'm very happy to hear that! |
Hm, interesting. I think the main difference isn't really the time direction, it's what order the mutations are stored in. A possible advantage of the leaf-to-root direction is that once you find a genotype, it doesn't change. If trees might have many mutations this could be good, as we wouldn't have to actually reach the root of the tree (only up to the most recent mutation for each sample of interest). But: not really relevant for population data. Your point about the number of comparisons between mutations is probably better, though. However, I think you'd have to go a bit further than just the last node. For instance, suppose that the mutations on that tree were [(20, T), (16,C), (13, C), (14, G)]. We'd need to keep track of the current list of mutations on the way back to the root, and compare all the way up that. Call this list S.
The recording the mutations in pre-order would take a bit of work, and is almost recording the whole tree of relationships between mutations - maybe this is something to think about? Keeping things in time order is simpler, but I guess if we want to compute a large number of statistics, this precomputatation might be worthwhile. It would be harder to record and harder to check, though. If we kept the whole tree of relationships between mutations we could combine both algorithms. What do you think? Let me know; I'll code up a version. |
Yes, you're right. We need to keep a stack of the mutations the nodes that we have traversed on the way down from the root. The algorithm you've outlined is the right approach I think.
Well, it depends on how you come up with the mutations in the first place. If you are generating them on the tree as a Markov process, then pre-order is the order in which they are naturally generated and putting them in time order is an extra sorting step on top of this. I'd imagine any method for placing mutations on the tree from real data would have a similar form, so I don't think storing them in preorder costs us anything. Checking is another story though, and I agree this would be expensive. Not terribly expensive I think, but certainly worse than checking for time sorting. What turned me off time sorting the mutations though is that it's actually much stricter than we need and can lead to unexpected problems. For example, when placing the mutations on trees inferred from data, it turned out that I had to put an extra sorting step at the end even though the mutations were in preorder. Similar ugliness happens during
Interesting! Can you explain a bit more what you are thinking here? |
To make this a bit more concrete, I knocked together a quick Jukes-Cantor mutation generator in Python: import msprime
import numpy as np
def generate_site_mutations(tree, position, mu, site_table, mutation_table):
"""
Generates mutations for the site at the specified position on the specified
tree. Mutations happen at rate mu along each branch. The site and mutation
information are recorded in the specified tables.
"""
assert tree.interval[0] <= position < tree.interval[1]
states = {"A", "C", "G", "T"}
state = np.random.choice(list(states))
site_table.add_row(position, state)
site = site_table.num_rows - 1
stack = [(tree.root, state)]
while len(stack) != 0:
u, state = stack.pop()
if u != tree.root:
branch_length = tree.branch_length(u)
num_mutations = np.random.poisson(mu * branch_length)
new_state = state
for _ in range(num_mutations):
new_state = np.random.choice(list(states - set(new_state)))
if state != new_state:
mutation_table.add_row(site, u, new_state)
state = new_state
stack.extend(reversed([(v, state) for v in tree.children(u)]))
def jukes_cantor(ts, num_sites, mu):
"""
Returns a copy of the specified tree sequence with Jukes-Cantor mutations
applied at the specfied rate at the specifed number of sites. Site positions
are chose uniformly.
"""
positions = np.random.uniform(0, ts.sequence_length, num_sites)
positions.sort()
sites = msprime.SiteTable(num_sites)
mutations = msprime.MutationTable(num_sites)
trees = ts.trees()
t = next(trees)
for position in positions:
while position >= t.interval[1]:
t = next(trees)
generate_site_mutations(t, position, mu, sites, mutations)
tables = ts.dump_tables()
new_ts = msprime.load_tables(
nodes=tables.nodes, edgesets=tables.edgesets, sites=sites, mutations=mutations)
return new_ts
def main():
np.random.seed(2)
mu = 1
ts = msprime.simulate(10, recombination_rate=0, random_seed=1)
new_ts = jukes_cantor(ts, 3, mu)
print("Sites:")
for site in new_ts.sites():
print("site :", site.position, site.ancestral_state)
for mutation in site.mutations:
print("\t", mutation.node, mutation.derived_state)
print("Haplotypes:")
for h in new_ts.haplotypes():
print(h)
if __name__ == "__main__":
main() Running this, I get
These mutations are in preordere and stored as they are produced. There is no post-processing needed. I guess this is a very naive algorithm (involving a full tree traversal for every site), but I'm not sure you're going to do much better if there are lots of mutations. What do you think @petrelharp? |
Well, it depends on how you come up with the mutations in the first place.
If you are generating them on the tree as a Markov process, then pre-order
is the order in which they are naturally generated and putting them in time
order is an extra sorting step on top of this. I'd imagine any method for
placing mutations on the tree from real data would have a similar form, so
I don't think storing them in preorder costs us anything.
Ah, ok! That makes sense then.
If we kept the whole tree of relationships between mutations we could
combine both algorithms.
Interesting! Can you explain a bit more what you are thinking here?
In the mutation table we could store an additional variable, 'parent
mutation', which gives the index of the mutation that this mutation
occurred on the background of. Then we wouldn't need to do any additional
traversing of the full tree to figure out how mutations are nested. If the
mutations are being generated with this information handy, it seems
worthwhile to store - it'd reduce complexity of the code that uses the
mutations.
|
Aha! Very cunning. Yes, this is the way to do it. Let me have a think about it. |
Here is a simple implementation of allele frequency algorithm when we have nodes sorted in preorder: def allele_frequency(tree, site):
"""
Returns the allele frequencies at the specified site.
"""
f = collections.Counter()
f[site.ancestral_state] = tree.sample_size
visited_mutations = {tree.root: site.ancestral_state}
for mutation in site.mutations:
num_leaves = tree.num_leaves(mutation.node)
f[mutation.derived_state] += num_leaves
u = mutation.node
while u != -1 and u not in visited_mutations:
u = tree.parent(u)
if u != -1:
f[visited_mutations[u]] -= num_leaves
visited_mutations[mutation.node] = mutation.derived_state
return {k: v for k, v in f.items() if v > 0}
def allele_frequencies(ts):
"""
Returns a dictionary of allele frequencies for each site.
"""
freqs = []
for tree in ts.trees():
for site in tree.sites():
freqs.append(allele_frequency(tree, site))
return freqs This requires a traversal back to root for each mutation (in worst case) which is O(log n) and a O(log # mutations) lookup to see if we have visited a particular mutation yet. This isn't great, but isn't terrible either. The question is, how much cheaper is this if we store the mutation tree? Is it worth the cost of extra storage and complexity? |
well, storing the tree of mutations would reduce those log(n)+log(#muts)
steps to log(#muts). it's actually log_2(n), no? so would be a factor of
order 10?
not sure how to balance this against storage, complexity. increase in
storage is one additional thing per derived mut, so, substantial. that's
like doubling memory footprint of mutations, no?
your view? i could implement this assuming storage of mutational parent.
…On Tue, Apr 4, 2017 at 6:37 AM, Jerome Kelleher ***@***.***> wrote:
Here is a simple implementation of allele frequency algorithm when we have
nodes sorted in preorder:
def allele_frequency(tree, site):
""" Returns the allele frequencies at the specified site. """
f = collections.Counter()
f[site.ancestral_state] = tree.sample_size
visited_mutations = {tree.root: site.ancestral_state}
for mutation in site.mutations:
num_leaves = tree.num_leaves(mutation.node)
f[mutation.derived_state] += num_leaves
u = mutation.node
while u != -1 and u not in visited_mutations:
u = tree.parent(u)
if u != -1:
f[visited_mutations[u]] -= num_leaves
visited_mutations[mutation.node] = mutation.derived_state
return {k: v for k, v in f.items() if v > 0}
def allele_frequencies(ts):
""" Returns a dictionary of allele frequencies for each site. """
freqs = []
for tree in ts.trees():
for site in tree.sites():
freqs.append(allele_frequency(tree, site))
return freqs
This requires a traversal back to root for each mutation (in worst case)
which is O(log n) and a O(log # mutations) lookup to see if we have visited
a particular mutation yet. This isn't great, but isn't terrible either. The
question is, how much cheaper is this if we store the mutation tree? Is it
worth the cost of extra storage and complexity?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/jeromekelleher/msprime/issues/122#issuecomment-291501989>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA_26VyREFLuYIiJ7akwl2sEsP7vNrjXks5rskeogaJpZM4LrVLq>
.
|
Yeah, that's pretty much my take. Although, if we could get rid of the dictionary lookup entirely, then that would be ideal.
We're currently storing (site, node, derived_state) for each mutation, so having (site, node, parent, derived_state) wouldn't be a significant extra cost. For me, it's all about the algorithm. If the algorithm is sufficiently clean and simple for getting allele frequencies, then it's worth the extra complexity of tracking the parent mutation for every mutation. If the algorithm looks roughly the same except we avoid the traversal to root, then it's not quite so clear to me that it's worth the effort. This is, after all, a relatively rare case of multiple mutations and we don't want to prematurely optimise.
Would you like to have a go at the allele frequency algorithm, given the mutational parent information? It should be easy enough to augment the code above the just pass around an extra array which is used in the allele frequency calculation. |
Adding this to the Jukes-Cantor example code above will do it: (swap in for the def get_parent_mutation(node, tree, site):
"""
Returns the index of the mutation strictly above node (or -1 if there is none).
"""
u = tree.parent(node)
mutation_nodes = [x.node for x in site.mutations]
while u is not -1 and u not in mutation_nodes:
u = tree.parent(u)
if u is -1:
return -1
else:
return mutation_nodes.index(u)
def allele_frequency(tree, site):
"""
Returns the allele frequencies at the specified site.
"""
f = collections.Counter()
f[site.ancestral_state] = tree.sample_size
for mutation in site.mutations:
num_leaves = tree.num_leaves(mutation.node)
f[mutation.derived_state] += num_leaves
parent_mutation = get_parent_mutation(mutation.node, tree, site)
if parent_mutation is -1:
parent_state = site.ancestral_state
else:
parent_state = site.mutations[parent_mutation].derived_state
f[parent_state] -= num_leaves
return {k: v for k, v in f.items() if v > 0}
# and adding to main():
print("Frequency spectrum:")
afs = allele_frequencies(new_ts)
for site in afs:
print("site:")
for x in site:
print("\t", x, site[x]) Here Note that we don't have to traverse the tree in any particular order, or keep track of which nodes we've visited already - just add leaves below the node and subtract those from the parent. |
Yes, this algorithm is nice isn't it? I'll have another look at this soon when I get a chance and see what's involved in making the updates to store the parent mutation. |
Whoops - my leaf-to-root algorithm above is not quite right, because as written it only subtracts descendants of any direct descendant mutations, not all descendant mutations. Well, that's true if you stop at the previous mutation; if instead you moved all the way up to the root it'd work. Moving root-down fixes this, because we haven't removed offspring of descendant nodes yet. There's an easy way to do this - let |
A note about the above -- at first we thought about generalizing to weight functions that took tuples, so if a particular site had three alleles with counts This has the advantage of making the arguments to |
OK, I've moved the TreeStat machinery back to working with a list of integers rather than a list of tuples (easy, because this is basically how it worked already). |
Now is also the time to finalize naming, btw. Currently, we have
Since both are on trees, a better name for the first might be |
There's a normalization problem: a TreeStat and a SiteStat with the same weighting function are almost the same thing. But with TreeStat, we sum This is analogous to mean pairwise diversity being equal to twice mutation rate times mean TMRCA. But those are established things we can't change; here we are in the position to define the "f4" statistic for branch lengths, and we should define it to be "the thing with the same weighting function". What to do about this? Options:
For consistency with others, either (1) or (2) is right, but it doesn't really matter - when we implement e.g. So, we should go with the conceptually more natural thing. I guess I vote for (1), because in computing TreeStats we are taking advantage of symmetry and only evaluating the function on on part of the bipartition. So, I'm going with (1) but can readjust if there's other opinions. Just to be clear, we are keeping the same interpretation of e.g. "branch length f4" in terms of branch lengths, as is clear from comparison to the use of this function in tests. This just means we need to divide the weighting function we use to get it by two (or actually, divide all output by two in the end). |
Golly gee, it seems to be working! Needs a bit more testing, but getting there... |
FYI: I've unified interfaces here. Now there are two classes: |
Sounds great, thanks @petrelharp! Re naming things, For the normalisation thing, I'm happy to follow your lead with either (1) or (2). |
I have changed the name, from |
There's a conceptual issue here if there can be more than two alleles. Ran into this in #290 : the definition I like of
I noticed this because the latter two things are not equivalent if there can be more than two alleles. (Good thing we implemented Jukes-Cantor for testing!) The way I proposed implementing things above was to have statistics additive across alleles, so that the weight associated with a segregating site is
where the sum is over alleles But then if a, b, and c have all distinct alleles, this gives weight 1/2 to the site even though there's a mutation occurring on the But now consider the (now, four-taxon) statistic that measures branches with But on the other hand, the correspondence between branch lengths and numbers of mutations only holds to the extent that sites have a single mutation on them; so it is not sensible to worry about such sites too much. Mulling this over a bit more. |
At issue is the desire to use the same weighting function for both types of statistic. If we change the weight to ask if the allele is inherited by a but not b or c, this is:
To "depolarize" this we currently have
This latter has the property that it is invariant under exchanging |
This all sounds sensible @petrelharp, but I'm afraid it's gone over my head to a large degree. I think we need to have a call sometime so you can explain this stuff to me! |
OK, I know what to do. tl;dr is: remove the factor of 2, sum over both sides of a branch for BranchLength stats, and only consider stats that are stated as properties of bipartitions. We should definately talk sometime; it'd help figure out how to write this down concisely. Let's consider only statistics having the following properties:
The second thing means that the statistic is formed by summing over the bipartitions induced by each allele, and is motivated by the fact that in an infinite sites model you can't have two mutations on the same tree, so if every tree were independent you couldn't estimate higher moments of the branch lengths. Also, it simplifies things substantially, and I would be suspicious of any popgen inference depending strongly on seeing more than two alleles per site. Consider
The first one is not admissible since it requires knowing the allelic states of all three samples -- consider the case where a=G, b=C, and c=T (that's a nucleotide C not a set C). You cannot find this latter probability by summing a single function across all bipartitions. (although: I haven't yet found a concise statement of what I mean there, and a proof) This means using the weighting function
and then, to depolarize it for BranchLength stats, adding both Another note about what's going on: if Furthermore, we are saying that a Site stat and a BranchLength stat are "the same" (better term?) if
and we are defining things so that (1) and (2) are equivalent conditions. |
Sounds good to me @petrelharp, but I've really only got a superficial grasp of what's going on here. I'm looking forward to rolling my sleeves up and properly getting to grips with this once I've cleared off a few things from the TODO mountain! |
This is done now with general branch stats code. |
Thoughts:
There's only quite such an easy correspondance between branch length and # of mutations if we assume all mutations are distinguishable. So to exactly match the branch length statistics, we'd ignore the fact that there may be double mutations. But to match what we see in the sequence (like we actually want to do): take divergence: a mutation that occurs on branches leading to both samples should be counted zero times, not twice. But, if the mutations returned by branch_mutations know where else on the tree they occur, then this is do-able with a bit more bookkeeping. Like for instance: first update the X's for each edge; then iterate over the mutations, evaluating (and caching) f(\sum_{branches} X) for each mutation to get the weight for that mutation. Or maybe just using the original algorithm but then going and looking individually at all the recurrent mutations to adjust.
The text was updated successfully, but these errors were encountered: