Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairsamtools subsampling [new tool, enhancement] #66

Closed
sergpolly opened this issue May 15, 2018 · 5 comments
Closed

pairsamtools subsampling [new tool, enhancement] #66

sergpolly opened this issue May 15, 2018 · 5 comments

Comments

@sergpolly
Copy link
Member

I feel like we would benefit from having a simple pairsamtools subsample tool (or an option to subsample for pairsamtools select) ...

The rationale being - to enable us to do some "rigorous" statistics/significance estimation/bootstrapping/permutation testing for some of the analyses, e.g., if we want to measure a "subtle" compartment strength difference between 2 experiments, and we have 10 mln and 12 mln pairs for the experiments - one can subsample both down to 5 mln several times and calculate a compartment strength for each subsample and compare the resultant distributions. Another example would be - subsampling and mixing mitotic and G1 pairs to check if some experimental effects could be explained by such a simple mixture, etc.

Technical notes/questions:

  • the only way to subsample in 1 pass (streaming like) is by knowing the total # of pairs (#pairs per chrom etc) a priori ?!
  • there might be need to implement more sophisticated samplings - distance dependent weights, chrom dependent weights, cis/trans, etc (do not overdo what's already available in select) ...
  • any other way to do a streaming-like subsample ? Do we need to care about its streaming nature ?
  • would pairix index help speed up subsampling ? Should we rely on it ?
  • does it seem likesubsample fit into select or it deserves to be a separate tool ?
@nvictus
Copy link
Member

nvictus commented May 15, 2018

the only way to subsample in 1 pass (streaming like) is by knowing the total # of pairs (#pairs per chrom etc) a priori ?!

The pairix index does have the total number of pairs, but not per chrom. CC @SooLee

If you want to sample a fixed number of pairs, rather than a proportion, then reservoir sampling can be used. Also see: https://www.biostars.org/p/110107/

@sergpolly
Copy link
Member Author

@nvictus but do you agree that it would be a generic-enough and overall useful tool to have ?
or does it sound like something more case-to-case specific ?

@nvictus
Copy link
Member

nvictus commented May 15, 2018

At its simplest, it seems to be a very generic operation. Unix shuf -n does what you want, but unfortunately loads the input entirely in memory. I'm a bit surprised there is no widely used reservoir sampler utility.

However, as many point out, if you're happy with an approximate result, it's a simple one-liner to downsample a stream of lines. Unless this tool would do more sophisticated things that select + naive downsample can't do, I don't see the need.

@nvictus
Copy link
Member

nvictus commented May 15, 2018

Ah, my bad. It seems shuf now implements reservoir sampling: https://lists.gnu.org/archive/html/coreutils/2013-12/msg00167.html

Also see https://github.com/alexpreynolds/sample

@agalitsyna agalitsyna mentioned this issue Apr 6, 2022
31 tasks
@agalitsyna
Copy link
Member

Hi, @sergpolly , @nvictus , isn't that resolved by pairtools sample by @Phlya and @golobor dated by 2019?
https://github.com/open2c/pairtools/blob/master/pairtools/pairtools_sample.py
I understand it does not have the sampling by distance and chromosomes, but it can be achieved by a set of select/sample/merge commands.

@open2c open2c locked and limited conversation to collaborators Apr 20, 2022
@agalitsyna agalitsyna converted this issue into discussion #125 Apr 20, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants