Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"mutation" mode for statistics #2982

Open
petrelharp opened this issue Sep 6, 2024 · 1 comment
Open

"mutation" mode for statistics #2982

petrelharp opened this issue Sep 6, 2024 · 1 comment

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Sep 6, 2024

In #2948 @tforest is working on adding time windows to statistics. How this should work for mode="branch" is clear; however, it is not clear for mode="site". (It doesn't make sense to have time windows for mode="node", btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.

So, what we'd really like to do, and what people probably mostly imagine is happening with mode="site", is to sum over all mutations, rather than alleles. This would be essentially equivalent to mode="branch", but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called "mutation" to statistics, that does this.

In a bit more detail. Recall that to compute a statistic we have some weights and a summary function, $f()$. For a node $n$ let $w_T(n)$ be the total weight of all samples below the node in tree $T$. Then the branch-mode statistic is
$$\sum_T |T| \sum_n \ell_T(n) f(w_T(n)) ,$$
where $|T|$ is the span of the tree and $\ell_T(n)$ is the length of the branch above $n$ in tree $T$. And, if we define $w_s(a)$ to be the total weight of all samples carrying allele $a$ at site $s$, then the site-mode stat is
$$\sum_s \sum_a f(w_s(a)) .$$
We can rewrite this as a sum over trees, pedantically, as
$$\sum_T \sum_{s \in T} \sum_{a \in s} f(\sum_{d_s(m) = a} w_T(n_m)),$$
where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum_{d_s(m) = a}$" means "the sum over all mutations at site $s$ whose derived allele is $a$"; and $n_m$ is the node for mutation $m$. This looks complicated but it's just what you'd imagine. So, we're proposing that mutation-mode just doesn't group together the distinct mutations producing the same allele:
$$\sum_T \sum_{s \in T} \sum_{a \in s} \sum_{d_s(m) = a} f(w_T(n_m)) .$$
If polarised=False, then the sum would also include "the root", i.e., include a term for $f(\bar w - \sum_m w_T(n_m))$, as if there were a mutation to the ancestral state at the root as well. (And, maybe the stat should be polarised by default, unlike other stats?)

There have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in mode="site") we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).

Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).

@hyanwong
Copy link
Member

hyanwong commented Sep 6, 2024

This sounds like a very good idea to me. I have several times wondered if such a thing should exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants