Default nan-handling policy is a memory hog #5

ivanov-v-v · 2019-07-26T12:10:01Z

As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.

huangyh09 · 2019-07-27T19:45:58Z

Very good point. The reason we used ".:.:.:.:.:." is to keep the same format (i.e., the same number of tags) even it is missing. I will check if common R/Python packages processing VCF files is compatible with "." for missing values. If positive, this indeed will save a lot of space.

Alternatively, from v0.1.6, it supports saving to sparse matrices for AD, DP, OTH tags.
please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz.
Also, you can use sparseVCF.py to convert existing VCF.gz into sparse matrices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default nan-handling policy is a memory hog #5

Default nan-handling policy is a memory hog #5

ivanov-v-v commented Jul 26, 2019

huangyh09 commented Jul 27, 2019

Default nan-handling policy is a memory hog #5

Default nan-handling policy is a memory hog #5

Comments

ivanov-v-v commented Jul 26, 2019

huangyh09 commented Jul 27, 2019