Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default nan-handling policy is a memory hog #5

Open
ivanov-v-v opened this issue Jul 26, 2019 · 1 comment
Open

Default nan-handling policy is a memory hog #5

ivanov-v-v opened this issue Jul 26, 2019 · 1 comment

Comments

@ivanov-v-v
Copy link

As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.

@huangyh09
Copy link
Collaborator

Very good point. The reason we used ".:.:.:.:.:." is to keep the same format (i.e., the same number of tags) even it is missing. I will check if common R/Python packages processing VCF files is compatible with "." for missing values. If positive, this indeed will save a lot of space.

Alternatively, from v0.1.6, it supports saving to sparse matrices for AD, DP, OTH tags.
please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz.
Also, you can use sparseVCF.py to convert existing VCF.gz into sparse matrices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants