Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aitchison and Robust Aitchison distance? #433

Closed
antagomir opened this issue Aug 14, 2021 · 19 comments
Closed

Aitchison and Robust Aitchison distance? #433

antagomir opened this issue Aug 14, 2021 · 19 comments

Comments

@antagomir
Copy link
Contributor

antagomir commented Aug 14, 2021

Aitchison distance and its robust version have become frequent choices in the analysis of microbial communities, where compositional data is ubiquitous. The ability get access to these through a dedicated existing package would be a better solution than creating new implementations, and allow seamless linking with other packages that rely on vegan::vegdist.

Aitchison distance needs pseudocount in many applications. An alternative, "robust Aitchison distance" has been proposed in the literature, the difference is that CLR transformation is done only on the non-zero values. This has gained some attention recently, see e.g. Martino et al. (2019) and from there links to more original references on robust CLR / robust Aitchison.

Would you consider adding Aitchison and Robust Aitchison distance as new options in vegan::vegdist ?

@jarioksa
Copy link
Contributor

Quick look at the paper indicates that the index can be calculated as a Euclidean distance after "CLR transformation". In vegan design this potentially means implementing CLR transformation in decostand. This route would be used if CLR transformation is regarded as useful outside distance calculations. Are you considering to submit the code as a pull request?

@antagomir
Copy link
Contributor Author

Yes, Aitchison distance equals to Euclidean distances in CLR-transformed data.

Yes, CLR transformation is useful also as such, outside of distance calculations. Despite its limitations (requirement of pseudocount etc) it is frequently used at least in microbial community analysis to remove compositionality bias in relative data and enhance statistical comparisons between samples. Also other log-ratio transformatins (ALR, ILR, phILR) are sometimes used for this.

But there are already several R packages that provide CLR and other log-ratio transformations. One of them is compositions. If you would prefer avoiding new dependencies, we could first check if this happens to be included in any of the existing dependencies. But these transformations are simple and should be relatively straightforward to implement directly in vegan, if necessary.

We can consider submitting the code as PR if this suggestion finds support.

@jarioksa
Copy link
Contributor

A comment about adding dependency: compositions adds a huge number of chained dependencies (i.e. it depends on packages that depend on packages that depend on packages that ... break). The transformations should be added independently or using a more light-weight dependency. The greatest complications on those transformations seem to be handling log(0) and 0/0.

@gavinsimpson
Copy link
Contributor

I'm pretty sure people doing this in community ecology just solve the log(x) for x=0 problem with the usual continuity correction and run everything with (equivalent of) log1p(x).

I'm also sure Cajo has written about this whole issue back in the day at least; I was under the impression that the closed compositional nature of the data largely becomes irrelevant once you are talking about more than 10s or 100s of taxa?

I would support this stuff being in Vegan; doing Aitchison's log ratio (contrast?) PCA has been something I have cooked together by hand on a number of occasions when I needed to replicate work done previously/elsewhere and this is an analysis that Canoco can handle trivially too, which is where most ecologists will have likely encountered/performed it.

I don't think we need to depend on compositions; as you say the dependencies would be undesirable.

@antagomir
Copy link
Contributor Author

antagomir commented Aug 20, 2021

The typical way to deal with zeroes is indeed log1p but in addition, there is a variant called robust CLR (see e.g. here) which performs the transformation only for non-zero values.

The effect of compositionality is mitigated in higher dimensions but not removed; and at least in microbial communities we frequently use also higher taxonomic levels (e.g. Phylum) and then the number of unique groups can be rather low.

We have already implemented clr and rclr in another package and could move these to vegan instead and call from there, it might be easy to add alr and ilr on the same go.

Shall we make a PR that adds:

  • clr, rclr to decostand
  • alr and ilr to decostand (note that these return n-1 samples ie. sample size drops by one; they use reference sample)
  • Aitchison distance (with clr) and robust Aitchison distance (with rclr) to vegdist

@nr0cinu
Copy link

nr0cinu commented Nov 22, 2021

I strongly support this request. CODA methods are becoming more and more common and required in microbial ecology research, so we would greatly benefit from them being natively supported in vegan :)

@antagomir
Copy link
Contributor Author

We will be happy to help, looking forward to admin comments before creating a possible PR.

@antagomir
Copy link
Contributor Author

Any opportunities for a PR, or shall we look for alternative solutions meanwhile?

The vegan implementation would be likely to have wide user base, considering how popular this transformation has lately become in microbial ecology.

@jarioksa
Copy link
Contributor

jarioksa commented Jan 1, 2022

PR would be very welcome!

@antagomir
Copy link
Contributor Author

I have prepared the decostand part (clr, rclr). Shall I make the PR to master branch, or some other branch?

(I do not have the permissions to open new branches to the vegandevs/vegan repository, so it has to be one the available ones).

@jarioksa
Copy link
Contributor

jarioksa commented Jan 7, 2022

Make the PR against the master. Even if you are going to pursue the task, this looks a self-contained PR that can be merged independently. I had a quick look at at the code, and it looked OK to me. I'll have a second & deeper look before the merge, but I don't expect any complications. Nice work!

@antagomir
Copy link
Contributor Author

Herewego!

@jarioksa
Copy link
Contributor

@antagomir : I would like to change the distance names in vegdist. Now they are aitchison & aitchison_robust. Firstly, I don't like the underscore in names, because it was not allowed in the original S language and older R: in S _ was a synonym and shortcut to <-. Secondly, in general we can use name completion in function calls and you need only write as many letters from the beginning of the function that the name is unique. Now the following will be an error vegdist(x, "aitchiso") and the minimum strings you need to write are aitchison vs aitchison_. So the names should differ earlier than in the tenth character (which being missing makes it aitchison). Any ideas? rAitchison? Perhaps @gavinsimpson has some suggestions.

@antagomir
Copy link
Contributor Author

I can add this. But waiting first if there are more comments on the names. Suggested ones are ok to me.

@antagomir antagomir mentioned this issue Jan 15, 2022
@johannesbjork
Copy link

johannesbjork commented Mar 28, 2023

Quick question (sorry if I missed it): For the "robust Aitchison", do you simply compute the Euclidean distance on the rclr transformed abundances "putting back" the 0s? (as you cannot compute distances on a matrix containing NAs)? Or do you do rclr in combination with matrix completion as in Martino et al. (2019)?

@antagomir
Copy link
Contributor Author

Thanks for pointing this out @johannesbjork - this implementation is with the simple replacement. I will check if we should clarify the documentation or add the matrix completion imputation step.

@johannesbjork
Copy link

Thanks for pointing this out @johannesbjork - this implementation is with the simple replacement. I will check if we should clarify the documentation or add the matrix completion imputation step.

I am not sure it makes sense adding the 0s back to compute the "robust Aitchison".
In rclr we remove the 0s to avoid adding pseudocount/imputation. And in the calculation of the distances, "0" has meaning. So in the end, the distances, while computed from ratios constructed from non-zero values, are still influenced by the 0s we were trying to get away from. Not sure I'm explaining myself very clearly (sorry).

@antagomir
Copy link
Contributor Author

I agree that the imputed version has clear advantages and aim to find time to add that as soon as possible. Thanks for drawing our attention to this.

@antagomir
Copy link
Contributor Author

This issues is closed, except the matrix completion step which is now added as a new issue #570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants