Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accomodate recurrent mutations in branch_stats algorithm #15

Closed
petrelharp opened this issue Jan 23, 2017 · 35 comments
Closed

accomodate recurrent mutations in branch_stats algorithm #15

petrelharp opened this issue Jan 23, 2017 · 35 comments

Comments

@petrelharp
Copy link
Contributor

Thoughts:

There's only quite such an easy correspondance between branch length and # of mutations if we assume all mutations are distinguishable. So to exactly match the branch length statistics, we'd ignore the fact that there may be double mutations. But to match what we see in the sequence (like we actually want to do): take divergence: a mutation that occurs on branches leading to both samples should be counted zero times, not twice. But, if the mutations returned by branch_mutations know where else on the tree they occur, then this is do-able with a bit more bookkeeping. Like for instance: first update the X's for each edge; then iterate over the mutations, evaluating (and caching) f(\sum_{branches} X) for each mutation to get the weight for that mutation. Or maybe just using the original algorithm but then going and looking individually at all the recurrent mutations to adjust.

@jeromekelleher
Copy link
Member

With the most recent merge, we also have to consider back mutations, and eventually > 2 alleles. I'm not sure what the best approach is, but it is something we need to consider.

@petrelharp
Copy link
Contributor Author

Related question: let X and Y be two independent, randomly chosen leaves of a tree, with P(X=Y)=0. Let $d(x,y)$ be the Hamming distance between the sequences at leaves $x$ and $y$, sequences that have evolved on the tree, and we know where/when/what all the mutations were. How do we efficiently calculate $E[d(X,Y)]$?

In the infinite sites model this is just $\sum_e P(e \in X->Y) m(e)$, where $m(e)$ is the number of mutations on edge $e$, and $e \in X->Y$ means that edge $e$ is on the path from $X$ to $Y$. But with backmutation, is there a way to calculate it without summing over all pairs of possible endpoints?

@jeromekelleher
Copy link
Member

Good question @petrelharp, I don't know the answer to that one.

I expect in general that many things will be a lot harder when we don't have the nice 1 site == 1 mutation relationship. However, we can test this on a site-by-site basis (each site has a reference to the list of mutations), so that a simple efficient algorithm can be used if the site has a single mutation, and a less efficient algorithm otherwise. I expect most sites will have this single mutation property.

I haven't really worked through the consequences of having multiple mutations per site yet, I'm just trying to represent the data that I have at the moment. LD and pi calculations will error out at moment if called on tree sequences with > 1 mutation per site.

One thing that I think will help is that I'm currently insisting that mutations must be listed in nonincreasing time order of nodes within a site, so that we can apply the effects of mutations sequentially and be sure that we get the right state at the leaves. Perhaps if we insist that mutations are listed in pre-order instead (i.e., in the order they would be visited in a top-down tree traversal), we would be able to compute things like the expected hamming distance efficiently in the general case.

@petrelharp
Copy link
Contributor Author

petrelharp commented Mar 16, 2017 via email

@jeromekelleher
Copy link
Member

jeromekelleher commented Mar 16, 2017

To be clear: you say 'we can test this' - are you planning to be able to easily distinguish mutations that have occurred more than once?

Well, it's easy to check if a site has more than one mutation; after that, I'm not sure what we can do. The new schema for sites/mutations looks like this:

========
sites
========
position           ancestral_state
0.1                  0
0.2                  0

========
mutations
========
site           node          derived_state
0               0                1
0               1                1
1               3                1
1               0                0

So, every site has a (unique) position and an ancestral state. In principle this can be any string, but in practise must be a single character. Every mutation happens at a given site, is over a single node (note that this is different to the previous model), and has a derived state. Mutations within a node must be listed in reverse time order (ie., down the tree) so that we can apply their effects sequentially to the leaves and be sure to get the right state. (As I mentioned above, in order of tree traversal might be more useful.)

In this case, I'm imaging two recurrent mutations over leaves 0 and 1 at site 0, and a mutation over internal node 3 with a back mutation at leaf 0.

In principle now, we can have alternating mutations happening all the way down the tree, so we'll need to reason a bit more carefully about things like allele frequencies. In the biallelic case I don't a think it's too awful (I'm sure there will be a way to efficiently know what the number of 1s at the leaves is), but in the general ACTG case I'm not sure what'll happen.

Does the schema make sense to you? Nothing is set in stone yet, so if you think it'll cause problems please do shout out!

@petrelharp
Copy link
Contributor Author

Observation: branches are equivalent to splits (i.e., bipartitions) are equivalent to biallelic sites.

Currently, the statistic is defined by

  1. a list of sets of samples $A_1,...,A_n$, and
  2. a function $f()$ that takes a list of integers and returns a number;
    then each mutation (or unit of branchlength) counts towards the statistic weighted by $f()$ of the vector of numbers of individuals in each set of samples inheriting from that mutation (or branch), $N_1,...,N_n$.

This definition applies perfectly fine to possibly recurrent mutations - the only issue is that we have to do a bit more work to determine what the $N$'s are for a particular mutation, since if it has occurred more than once then these aren't the same as the $N$'s for the branch it lies on.

What about sites with more than one allele? There is not currently a consensus about how to compute single-site statistics like divergence using e.g. triallelic sites, and there's no obvious (simple) best answer, so this definition is as good as any. Concretely, let's take the definition above after replacing "each mutation" by "each allele". So, at sites with $k$ alleles, we'd compute the vector of $N$'s for each of the $k$ alleles, apply $f()$ to each of these, and sum them.

This is slightly different than before, because with biallelic sites if we know the $N$'s for a particular mutation, we also know them for the alternate allele, and we exploited this in defining particular $f()$s. We can't just sum over derived mutations, either: define divergence to be "density of differing sites"; let $A_1$ and $A_2$ both contain exactly one sample, and let $f(x,y) = xor(x,y)$. This counts derived alleles inherited by $A_1$ or $A_2$ but not both. Biallelic sites are counted correctly, but triallelic sites where $A_1$ and $A_2$ each inherit different derived alleles would get counted twice, incorrectly. The solution is to let $f(x,y) = xor(x,y)/2$, and sum over all alleles, so that

  • at a site with ancestral state G, sampled sequences G, C, we'd get $N(G) = (1,0)$ and $N(C) = (0,1)$ so that $f(N(G)) + f(N(C)) = 1$
  • at a site with ancestral state G, sampled sequences T, C, we'd get $N(T) = (1,0)$ and $N(C) = (0,1)$ and $N(G) = (0,0)$, so that again $f(N(G)) + f(N(C)) + f(N(T))= 1$.

Any statistic that doesn't require knowing the ancestral state should be invariant under changing which allele is 'ancestral', so I think (but haven't checked) that translating previous $f()$'s for particular statistics to this scheme only requires dividing them all by two (as in the divergence example).

So, my proposal is to replace the definition of the statistic so that for branch lengths both $f(N)$ and $f(|A|-N)$ contribute. This then requires us to work out the $N$s for each allele at each polymorphic site, maybe by

  1. start with the $N$s computed for each branch in the current algorithm
  2. postprocess these to obtain the $N$ for alleles at that site (need to work out how to do this)

How's that sound?

@jeromekelleher
Copy link
Member

This sounds good @petrelharp. I don't really know how to deal with multiple alleles properly. In general, just knowing what the number of alleles actually is seems tricky (imagine lots of flip-flopping mutations down along the tree), and it seems that in the worst case it's hard to do anything without a full tree traversal at each site.

It really gets pretty nasty, and I have no immediate application for the ACTG alphabet, so I wonder whether it's better to comprehensively deal with the binary case (with arbitrary recurrent/back mutations), and worry about k-alleles later? But, I also worry that we'll have to undo a bunch of stuff and change APIs down the road if we don't tackle k alleles now, at least identifying what the problems are and knowing what the outline solution to them will be.

@petrelharp
Copy link
Contributor Author

I don't think it's that hard, given the information we have at hand?

Suppose that for a given site we have
(a) a list of mutations $(i_k,x_k)$: node and derived state, ordered by increasing time-ago of the node,
(b) the sparse tree for this site
(c) for each node $i$, the set of samples $D(i)$ that descend from the node
I'll assume there is at most one mutation per node; for things we're considering here only the most recent one matters.

Here is an algorithm to obtian genotypes at this site. For each $k$ beginning from $k=1$:

  1. Assign everyone in $D(i_k)$ the allele $x_k$.
  2. Move back up through the tree from $i_k$, checking to see if we pass through any other $i_j$ along the way (easy as they are ordered by time); for any $j$ that $i_k$ inherits from $i_j$, update $D(i_j) \mapsto D(i_j) \setdiff D(i_k)$.
    At the end, assign anyone without an allele to the ancestral state.

As for computing statistics: the key is producing, for a given subset $A$ of samples and for each allele $x$ at the site, how many individuals in $A$ have allele $x$. Now instead of (c) above suppose we know
(c') for each node $i$, the number of individuals in $A$ inherited from that node, denoted $N(i)$.
To produce the required numbers, say, $N(x)$, the algorithm above can be used, but updating $N(i_j) \mapsto N(i_j) - N(i_k)$ instead. We'll also need to check if the allele $x_k=x_j$ for some $j<k$, in which case we update $N(i_j) += N(i_k)$.

In retrospect: I think that your schema will allow us these fairly efficient algorithms, so I'm good with it. I don't see how to improve it (especially as multiple mutations are rare).

@jeromekelleher
Copy link
Member

Thanks for looking at this @petrelharp, it's a great help to get your input.

It's interesting that your algorithm for generating genotypes goes from leaf-to-root; I would have thought the other way around was the way to do it. Consider the following tree:

tree

Suppose we have one site with ancestral state A, and three mutations: [(19, T), (13, C), (14, G)]. We want to compute the frequency of all alleles.

  • First, set f[T] = 6, because there are 6 leaves below 19.
  • Then, set f[C] = 2 (two leaves below 13) and f[T] = 6 - 2 = 4, because 19 is an ancestor of 13.
  • Then, set f[G] = 2.
  • Finally, set f[A] = n - sum(f).

I've ordered the nodes at the site in pre-order here, as I think it makes it easier to reason about what's happening in the current subtree (i.e., here we fully finish up with the subtree below 19 before moving to look at the mutation at 14). It seems to me that there is a bit of efficiency to be gained there because we don't have to look at the entire list of nodes but only the last when we're moving back up the tree to see if we're affecting any existing mutations.

There is a slight trickiness when computing the frequency of the alleles when we have back mutations to the ancestral state, but it's a small detail. I think this will work for calculating frequencies within subsets as well. The cost should be O((k - 1) * height of tree) if we have k mutations at a site, since in worst case we traverse up to the root k - 1 times to find ancestral mutations. This probably about as efficient as it's going to get.

What do you think, does the algorithm make sense to you? I haven't written it out explicitly here, but hopefully it's fairly clear.

In retrospect: I think that your schema will allow us these fairly efficient algorithms, so I'm good with it. I don't see how to improve it (especially as multiple mutations are rare).

I'm very happy to hear that!

@petrelharp
Copy link
Contributor Author

Hm, interesting. I think the main difference isn't really the time direction, it's what order the mutations are stored in.

A possible advantage of the leaf-to-root direction is that once you find a genotype, it doesn't change. If trees might have many mutations this could be good, as we wouldn't have to actually reach the root of the tree (only up to the most recent mutation for each sample of interest). But: not really relevant for population data.

Your point about the number of comparisons between mutations is probably better, though. However, I think you'd have to go a bit further than just the last node. For instance, suppose that the mutations on that tree were [(20, T), (16,C), (13, C), (14, G)]. We'd need to keep track of the current list of mutations on the way back to the root, and compare all the way up that. Call this list S.

  1. First, set f[T] = 8, because there are 8 leaves below 20; set S=[20].
  2. Then, since 20 is an ancestor of 16 set f[C] = 2 (two leaves below 16) and f[T] = 8 - 2 = 6; and S=[20,16].
  3. Next, check that 16 is not an ancestor of 13, but 20 is, so set f[C] = 2 + 2 = 4 (two more leaves below 16) and f[T] = 6 - 2 = 4; and S=[20,13].
  4. Then, since 14 is not descended from 20 or 14, set f[G] = 2.
  5. Finally, set f[A] = n - sum(f).

The recording the mutations in pre-order would take a bit of work, and is almost recording the whole tree of relationships between mutations - maybe this is something to think about? Keeping things in time order is simpler, but I guess if we want to compute a large number of statistics, this precomputatation might be worthwhile. It would be harder to record and harder to check, though.

If we kept the whole tree of relationships between mutations we could combine both algorithms.

What do you think? Let me know; I'll code up a version.

@jeromekelleher
Copy link
Member

Your point about the number of comparisons between mutations is probably better, though. However, I think you'd have to go a bit further than just the last node. For instance, suppose that the mutations on that tree were [(20, T), (16,C), (13, C), (14, G)]. We'd need to keep track of the current list of mutations on the way back to the root, and compare all the way up that. Call this list S.

Yes, you're right. We need to keep a stack of the mutations the nodes that we have traversed on the way down from the root. The algorithm you've outlined is the right approach I think.

The recording the mutations in pre-order would take a bit of work, and is almost recording the whole tree of relationships between mutations - maybe this is something to think about? Keeping things in time order is simpler, but I guess if we want to compute a large number of statistics, this precomputatation might be worthwhile. It would be harder to record and harder to check, though.

Well, it depends on how you come up with the mutations in the first place. If you are generating them on the tree as a Markov process, then pre-order is the order in which they are naturally generated and putting them in time order is an extra sorting step on top of this. I'd imagine any method for placing mutations on the tree from real data would have a similar form, so I don't think storing them in preorder costs us anything.

Checking is another story though, and I agree this would be expensive. Not terribly expensive I think, but certainly worse than checking for time sorting.

What turned me off time sorting the mutations though is that it's actually much stricter than we need and can lead to unexpected problems. For example, when placing the mutations on trees inferred from data, it turned out that I had to put an extra sorting step at the end even though the mutations were in preorder. Similar ugliness happens during simplify, which I also haven't dealt with properly in the general mutation case.

If we kept the whole tree of relationships between mutations we could combine both algorithms.

Interesting! Can you explain a bit more what you are thinking here?

@jeromekelleher
Copy link
Member

To make this a bit more concrete, I knocked together a quick Jukes-Cantor mutation generator in Python:

import msprime
import numpy as np

def generate_site_mutations(tree, position, mu, site_table, mutation_table):
    """
    Generates mutations for the site at the specified position on the specified
    tree. Mutations happen at rate mu along each branch. The site and mutation
    information are recorded in the specified tables.
    """
    assert tree.interval[0] <= position < tree.interval[1]
    states = {"A", "C", "G", "T"}
    state = np.random.choice(list(states))
    site_table.add_row(position, state)
    site = site_table.num_rows - 1
    stack = [(tree.root, state)]
    while len(stack) != 0:
        u, state = stack.pop()
        if u != tree.root:
            branch_length = tree.branch_length(u)
            num_mutations = np.random.poisson(mu * branch_length)
            new_state = state
            for _ in range(num_mutations):
                new_state = np.random.choice(list(states - set(new_state)))
            if state != new_state:
                mutation_table.add_row(site, u, new_state)
            state = new_state
        stack.extend(reversed([(v, state) for v in tree.children(u)]))

def jukes_cantor(ts, num_sites, mu):
    """
    Returns a copy of the specified tree sequence with Jukes-Cantor mutations
    applied at the specfied rate at the specifed number of sites. Site positions
    are chose uniformly.
    """
    positions = np.random.uniform(0, ts.sequence_length, num_sites)
    positions.sort()
    sites = msprime.SiteTable(num_sites)
    mutations = msprime.MutationTable(num_sites)
    trees = ts.trees()
    t = next(trees)
    for position in positions:
        while position >= t.interval[1]:
            t = next(trees)
        generate_site_mutations(t, position, mu, sites, mutations)
    tables = ts.dump_tables()
    new_ts = msprime.load_tables(
        nodes=tables.nodes, edgesets=tables.edgesets, sites=sites, mutations=mutations)
    return new_ts

def main():
    np.random.seed(2)
    mu = 1
    ts = msprime.simulate(10, recombination_rate=0, random_seed=1)
    new_ts = jukes_cantor(ts, 3, mu)
    print("Sites:")
    for site in new_ts.sites():
        print("site :", site.position, site.ancestral_state)
        for mutation in site.mutations:
            print("\t", mutation.node, mutation.derived_state)

    print("Haplotypes:")
    for h in new_ts.haplotypes():
        print(h)

if __name__ == "__main__":
    main()

Running this, I get

Sites:
site : 0.025926231827891333 G
         12 C
         15 C
         2 G
site : 0.43599490214200376 T
         12 A
         4 A
         16 C
         11 T
         6 A
site : 0.5496624778787091 C
         12 A
         4 G
         16 G
         6 T
Haplotypes:
GTG
CAA
GCG
GTG
GAG
CCG
CAT
CAA
GTG
CCG

These mutations are in preordere and stored as they are produced. There is no post-processing needed.

I guess this is a very naive algorithm (involving a full tree traversal for every site), but I'm not sure you're going to do much better if there are lots of mutations.

What do you think @petrelharp?

@petrelharp
Copy link
Contributor Author

petrelharp commented Mar 31, 2017 via email

@jeromekelleher
Copy link
Member

In the mutation table we could store an additional variable, 'parent
mutation', which gives the index of the mutation that this mutation
occurred on the background of. Then we wouldn't need to do any additional
traversing of the full tree to figure out how mutations are nested. If the
mutations are being generated with this information handy, it seems
worthwhile to store - it'd reduce complexity of the code that uses the
mutations.

Aha! Very cunning. Yes, this is the way to do it. Let me have a think about it.

@jeromekelleher
Copy link
Member

Here is a simple implementation of allele frequency algorithm when we have nodes sorted in preorder:

def allele_frequency(tree, site):
    """
    Returns the allele frequencies at the specified site.
    """
    f = collections.Counter()
    f[site.ancestral_state] = tree.sample_size
    visited_mutations = {tree.root: site.ancestral_state}
    for mutation in site.mutations:
        num_leaves = tree.num_leaves(mutation.node)
        f[mutation.derived_state] += num_leaves
        u = mutation.node
        while u != -1 and u not in visited_mutations:
            u = tree.parent(u)
        if u != -1:
            f[visited_mutations[u]] -= num_leaves
        visited_mutations[mutation.node] = mutation.derived_state
    return {k: v for k, v in f.items() if v > 0}

def allele_frequencies(ts):
    """
    Returns a dictionary of allele frequencies for each site.
    """
    freqs = []
    for tree in ts.trees():
        for site in tree.sites():
            freqs.append(allele_frequency(tree, site))
    return freqs

This requires a traversal back to root for each mutation (in worst case) which is O(log n) and a O(log # mutations) lookup to see if we have visited a particular mutation yet. This isn't great, but isn't terrible either. The question is, how much cheaper is this if we store the mutation tree? Is it worth the cost of extra storage and complexity?

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 5, 2017 via email

@jeromekelleher
Copy link
Member

well, storing the tree of mutations would reduce those log(n)+log(#muts)
steps to log(#muts). it's actually log_2(n), no? so would be a factor of
order 10?

Yeah, that's pretty much my take. Although, if we could get rid of the dictionary lookup entirely, then that would be ideal.

not sure how to balance this against storage, complexity. increase in
storage is one additional thing per derived mut, so, substantial. that's
like doubling memory footprint of mutations, no?

We're currently storing (site, node, derived_state) for each mutation, so having (site, node, parent, derived_state) wouldn't be a significant extra cost.

For me, it's all about the algorithm. If the algorithm is sufficiently clean and simple for getting allele frequencies, then it's worth the extra complexity of tracking the parent mutation for every mutation. If the algorithm looks roughly the same except we avoid the traversal to root, then it's not quite so clear to me that it's worth the effort. This is, after all, a relatively rare case of multiple mutations and we don't want to prematurely optimise.

your view? i could implement this assuming storage of mutational parent.

Would you like to have a go at the allele frequency algorithm, given the mutational parent information? It should be easy enough to augment the code above the just pass around an extra array which is used in the allele frequency calculation.

@petrelharp
Copy link
Contributor Author

petrelharp commented Apr 18, 2017

Adding this to the Jukes-Cantor example code above will do it: (swap in for the allele_frequency function above)

def get_parent_mutation(node, tree, site):
    """
    Returns the index of the mutation strictly above node (or -1 if there is none).
    """
    u = tree.parent(node)
    mutation_nodes = [x.node for x in site.mutations]
    while u is not -1 and u not in mutation_nodes:
        u = tree.parent(u)
    if u is -1:
        return -1
    else:
        return mutation_nodes.index(u)

def allele_frequency(tree, site):
    """
    Returns the allele frequencies at the specified site.
    """
    f = collections.Counter()
    f[site.ancestral_state] = tree.sample_size
    for mutation in site.mutations:
        num_leaves = tree.num_leaves(mutation.node)
        f[mutation.derived_state] += num_leaves
        parent_mutation = get_parent_mutation(mutation.node, tree, site)
        if parent_mutation is -1:
            parent_state = site.ancestral_state
        else:
            parent_state = site.mutations[parent_mutation].derived_state
        f[parent_state] -= num_leaves
    return {k: v for k, v in f.items() if v > 0}

# and adding to main():

    print("Frequency spectrum:")
    afs = allele_frequencies(new_ts)
    for site in afs:
        print("site:")
        for x in site:
            print("\t", x, site[x])

Here get_parent_mutation is standing in for actually having recorded this and stored it in the mutation table.

Note that we don't have to traverse the tree in any particular order, or keep track of which nodes we've visited already - just add leaves below the node and subtract those from the parent.

@jeromekelleher
Copy link
Member

Yes, this algorithm is nice isn't it? I'll have another look at this soon when I get a chance and see what's involved in making the updates to store the parent mutation.

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 21, 2017

Whoops - my leaf-to-root algorithm above is not quite right, because as written it only subtracts descendants of any direct descendant mutations, not all descendant mutations. Well, that's true if you stop at the previous mutation; if instead you moved all the way up to the root it'd work. Moving root-down fixes this, because we haven't removed offspring of descendant nodes yet.

There's an easy way to do this - let X[j] but the number of nodes below j and let p[j] be the parent of j; then first initialize U[j] = X[j] and then for each j, subtract X[j] from U[p[j]].

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 22, 2017

A note about the above -- at first we thought about generalizing to weight functions that took tuples, so if a particular site had three alleles with counts n[1], n[2] and n[3], then we'd add f(n[1], n[2], n[3]). But, we've settled on the less general form of f(n[1]) + f(n[2]) + f(n[3]). This is only less general for more-than-biallelic sites, because we know beforehand what sum(n) is, so that the function can be f(n[k], sum(n)).

This has the advantage of making the arguments to f of known length, which will make it easier to move them into C.

@petrelharp
Copy link
Contributor Author

OK, I've moved the TreeStat machinery back to working with a list of integers rather than a list of tuples (easy, because this is basically how it worked already).

@petrelharp
Copy link
Contributor Author

Now is also the time to finalize naming, btw. Currently, we have

  • TreeStatCalculator for the statistics based on branch lengths and
  • SiteStat Calculator for the mutation-based, single site statistics.

Since both are on trees, a better name for the first might be BranchLengthStatCalculator. Or not? What do you think, @jeromekelleher @ashander ?

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 22, 2017

There's a normalization problem: a TreeStat and a SiteStat with the same weighting function are almost the same thing. But with TreeStat, we sum f(x) over all branches. This would be the equivalent of summing over all derived alleles. But with SiteStat we include the ancestral allele also. Since all our stats (thus far) don't care which allele is ancestral, this means that the TreeStat is one-half of what the SiteStat would be for the same weighting function (in the limit of dense but still biallelic mutations).

This is analogous to mean pairwise diversity being equal to twice mutation rate times mean TMRCA. But those are established things we can't change; here we are in the position to define the "f4" statistic for branch lengths, and we should define it to be "the thing with the same weighting function".

What to do about this? Options:

  1. multiply everything by two for TreeStats
  2. divide everything by two for SiteStats
  3. deal with the discrepancy

For consistency with others, either (1) or (2) is right, but it doesn't really matter - when we implement e.g. f4() we will do it so that we get the correct f4 statistic for SiteStats; for one of these options we'll have to adjust our currently-defined weighting function by a factor of 2 (not sure which one).

So, we should go with the conceptually more natural thing. I guess I vote for (1), because in computing TreeStats we are taking advantage of symmetry and only evaluating the function on on part of the bipartition.

So, I'm going with (1) but can readjust if there's other opinions.

Just to be clear, we are keeping the same interpretation of e.g. "branch length f4" in terms of branch lengths, as is clear from comparison to the use of this function in tests. This just means we need to divide the weighting function we use to get it by two (or actually, divide all output by two in the end).

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 22, 2017

The factors of two have been inserted, and all cancel with each other. =)

(this isn't terribly inefficient because it only happens on the output, not actually every time we call the weighting function, e.g. here canceling with here)

@petrelharp
Copy link
Contributor Author

Golly gee, it seems to be working! Needs a bit more testing, but getting there...

@petrelharp
Copy link
Contributor Author

FYI: I've unified interfaces here. Now there are two classes: SiteStatCalculator and TreeStatCalculator, that both pretty much only provide their own tree_stat_vector() method, which does the weighted function summing thing. All the specialized statistics like divergence, f4, Y, etc are defined in the parent class to both, GeneralStatCalculator. This is possible becuse they use the same weighting functions.

@jeromekelleher
Copy link
Member

Sounds great, thanks @petrelharp! Re naming things, BranchLengthStatCalculator and SiteStatCalculator seem good to me. BranchStatCalculator or TopologyStatCalculator might also be worth considering, but maybe the precision is good here as we might be calculating other statistics on topologies later (i.e., tree distance metrics). I'd imagine we'll try and hide these classes behind some user-friendly function interface anyway, so it's OK if the class names are a bit clunky.

For the normalisation thing, I'm happy to follow your lead with either (1) or (2).

@petrelharp
Copy link
Contributor Author

I have changed the name, from TreeStatCalculator to BranchLengthStatCalculator. cc @ashander

@petrelharp
Copy link
Contributor Author

There's a conceptual issue here if there can be more than two alleles.

Ran into this in #290 : the definition I like of Y3(A;B,C) is that it is the density of sites at which we have a split a|bc, where a, b, and c are random draws from A, B, and C respectively; or the expected value if working from branch lengths. Under a continuum-sites model on a segment of unit length, withA=[a], B=[b], and C=[c], this is

  • total length of branches separating a|bc
  • number of sites with alleles a != b and b == c
  • number of sites with alleles a != b and a != c.

I noticed this because the latter two things are not equivalent if there can be more than two alleles. (Good thing we implemented Jukes-Cantor for testing!) The way I proposed implementing things above was to have statistics additive across alleles, so that the weight associated with a segregating site is

sum_i f(a_i, n_a - a_i; b_i, n_b - b_i, c_i, n_b - c_i) 

where the sum is over alleles i and a_i is the number of samples in A having allele i, and n_a is the total number of samples in A - the weight on an allele can only depend on the number of samples that have the allele and the number that don't. So the way I implemented Y3 works out, in the case where n_a = n_b = n_c = 1, was to assign weight 1/2 to each allele that either a has but b and c do not, or that b and c have but a does not.

But then if a, b, and c have all distinct alleles, this gives weight 1/2 to the site even though there's a mutation occurring on the a|bc branch and really we want it to have weight 1. We could fix this by changing the weighting function to assign weight 1 to an allele that a has while b and c do not (and is zero in other cases).

But now consider the (now, four-taxon) statistic that measures branches with ab|cd. In principle sites at which all four samples differ should contribute to this, but to do so starts to become a complicated problem.

But on the other hand, the correspondence between branch lengths and numbers of mutations only holds to the extent that sites have a single mutation on them; so it is not sensible to worry about such sites too much.

Mulling this over a bit more.

@petrelharp
Copy link
Contributor Author

At issue is the desire to use the same weighting function for both types of statistic. If we change the weight to ask if the allele is inherited by a but not b or c, this is:

def w(x):
    return x[1] * (n[2] - x[2]) * (n[3] - x[3]) / (n[1] * n[2] * n[3])

To "depolarize" this we currently have

def w_nonpolarized(x):
    return (x[1] * (n[2] - x[2]) * (n[3] - x[3]) + (n[1] - x[1]) * x[2] * x[3])/ (2 * n[1] * n[2] * n[3])

This latter has the property that it is invariant under exchanging x and n-x; this is a property that guarentees it does not depend on polarization (ancestral state), and means that in the tree ((b,c),a) when applied to branches it will correctly count both the edge going from the root to a as well as the edge from the root to mrca(b,c). If we did not have this property then we'd need to do something like add both w(x, n-x) and w(n-x, x) to the statistic in the BranchLength version; this is the analogous thing to including the ancestral allele in the Site version's computation.

@jeromekelleher
Copy link
Member

This all sounds sensible @petrelharp, but I'm afraid it's gone over my head to a large degree. I think we need to have a call sometime so you can explain this stuff to me!

@petrelharp
Copy link
Contributor Author

OK, I know what to do. tl;dr is: remove the factor of 2, sum over both sides of a branch for BranchLength stats, and only consider stats that are stated as properties of bipartitions. We should definately talk sometime; it'd help figure out how to write this down concisely.

Let's consider only statistics having the following properties:

  1. they do not depend on the ancestral state, and
  2. they are linear in the branches/alleles.

The second thing means that the statistic is formed by summing over the bipartitions induced by each allele, and is motivated by the fact that in an infinite sites model you can't have two mutations on the same tree, so if every tree were independent you couldn't estimate higher moments of the branch lengths. Also, it simplifies things substantially, and I would be suspicious of any popgen inference depending strongly on seeing more than two alleles per site.

Consider Y3(A;B,C). Two possible definitions as above are:

  1. density of sites with alleles a != b and b == c
  2. density of sites with alleles a != b and a != c.

The first one is not admissible since it requires knowing the allelic states of all three samples -- consider the case where a=G, b=C, and c=T (that's a nucleotide C not a set C). You cannot find this latter probability by summing a single function across all bipartitions. (although: I haven't yet found a concise statement of what I mean there, and a proof)

This means using the weighting function

def w(x):
    return x[1] * (n[2] - x[2]) * (n[3] - x[3]) / (n[1] * n[2] * n[3])

and then, to depolarize it for BranchLength stats, adding both w(x) and w(n-x) to the statistic. This is twice as much computation but means I get to take out that factor of two. For Site stats, this is automatically depolarized because we sum over all alleles (including the ancestral one).

Another note about what's going on: if w is a degree-k polynomial, that means it only depends on the distribution of trees with at most k tips, because it can be written as a weighted sum of probabilities of k-allele patterns. We should call this the degree of the statistic, and furthermore say that it is homogeneous if it can be written as a weighted sum of products of exactly k terms where each term is either x[i] or n[i] - x[i].

Furthermore, we are saying that a Site stat and a BranchLength stat are "the same" (better term?) if

  1. they have the same weighting function, and
  2. the BranchLength one is the expected value of the Stat one, averaging over laying down infinite-sites mutations;

and we are defining things so that (1) and (2) are equivalent conditions.

@jeromekelleher
Copy link
Member

Sounds good to me @petrelharp, but I've really only got a superficial grasp of what's going on here. I'm looking forward to rolling my sleeves up and properly getting to grips with this once I've cleared off a few things from the TODO mountain!

@petrelharp petrelharp transferred this issue from tskit-dev/msprime Jan 9, 2019
@jeromekelleher
Copy link
Member

This is done now with general branch stats code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants