Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflow in metadata info file #806

Open
gringer opened this issue Oct 17, 2022 · 2 comments
Open

Integer overflow in metadata info file #806

gringer opened this issue Oct 17, 2022 · 2 comments

Comments

@gringer
Copy link

gringer commented Oct 17, 2022

Using salmon alevin v1.9.0, I noticed that my total reads were less than the deduplicated UMI count when I combined all three libraries together (from a NovaSeq run):

{
    "total_reads": 284216343,
    "reads_with_N": 165542,
    "noisy_cb_reads": 1240522569,
    "noisy_umi_reads": 6297,
    "used_reads": 3338489231,
    "mapping_rate": 52.32744469106451,
    "reads_in_eqclasses": 2396169786,
    "total_cbs": 48399818,
    "used_cbs": 867051,
    "initial_whitelist": 49593,
    "low_conf_cbs": 1000,
    "num_features": 5,
    "final_num_cbs": 40432,
    "deduplicated_umis": 359865640,
    "mean_umis_per_cell": 8900,
    "mean_genes_per_cell": 2814
}

I suspect this has happened due to an integer overflow: 284216343 + 2^32 = 4579183639, which matches the total count that I get when I add the total reads from each barcoded sample together:

==> salmon_1.9_OG_2022-Oct-13_S1/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1550672340,
    "reads_with_N": 56210,
    "noisy_cb_reads": 465287865,
    "noisy_umi_reads": 3313,
    "used_reads": 1085324952,
    "mapping_rate": 52.507052779441469,
    "reads_in_eqclasses": 814212344,
    "total_cbs": 32341973,
    "used_cbs": 1550909,
    "initial_whitelist": 28000,
    "low_conf_cbs": 991,
    "num_features": 5,
    "final_num_cbs": 18888,
    "deduplicated_umis": 113155025,
    "mean_umis_per_cell": 5990,
    "mean_genes_per_cell": 2035
}

==> salmon_1.9_OG_2022-Oct-13_S2/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1371374162,
    "reads_with_N": 50003,
    "noisy_cb_reads": 389036191,
    "noisy_umi_reads": 3005,
    "used_reads": 982284963,
    "mapping_rate": 54.0580725189425,
    "reads_in_eqclasses": 741338439,
    "total_cbs": 30332499,
    "used_cbs": 1470602,
    "initial_whitelist": 28000,
    "low_conf_cbs": 997,
    "num_features": 5,
    "final_num_cbs": 19134,
    "deduplicated_umis": 127624221,
    "mean_umis_per_cell": 6670,
    "mean_genes_per_cell": 2229
}

==> salmon_1.9_OG_2022-Oct-13_S3/aux_info/alevin_meta_info.json <==
{
    "total_reads": 1657137137,
    "reads_with_N": 59329,
    "noisy_cb_reads": 447471964,
    "noisy_umi_reads": 3629,
    "used_reads": 1209602215,
    "mapping_rate": 55.061293216313938,
    "reads_in_eqclasses": 912441138,
    "total_cbs": 33411349,
    "used_cbs": 1567701,
    "initial_whitelist": 28000,
    "low_conf_cbs": 997,
    "num_features": 5,
    "final_num_cbs": 18395,
    "deduplicated_umis": 125889439,
    "mean_umis_per_cell": 6843,
    "mean_genes_per_cell": 2248
}

To Reproduce
Steps and data to reproduce the behavior:

  1. Run salmon alevin on more than 2^32 sequenced reads

Specifically, please provide at least the following information:

  • Which version of salmon was used? v1.9.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)? binary download from github
  • Which reference (e.g. transcriptome) was used? Gencode Human v41 + CHM13 v2.0 assembly
  • Which read files were used? BD Rhapsody + NovaSeq
  • Which which program options were used?
[cell barcodes were pre-corrected and merged using my own [custom script](https://gitlab.com/gringer/bioinfscripts/-/blob/master/synthSquish.pl)]
salmon alevin -l ISR \
  -1 $(ls demultiplexed/squished_${machineID}*_R1_001.fastq.gz | sort) \
  -2 $(ls demultiplexed/${machineID}*_R2_001.fastq.gz | sort) \
  -i ${indexDir}/${indexName} --expectCells ${expectCellCount} \
  -p 10 -o salmon_1.9_cbc_${projectID}_combined --tgMap ${indexDir}/txp2gene_${targetName}.txt \
  --umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

Expected behavior

{
    "total_reads": 4579183639,
    "reads_with_N": 165542,
    "noisy_cb_reads": 1240522569,
    "noisy_umi_reads": 6297,
    "used_reads": 3338489231,
    "mapping_rate": 52.32744469106451,
    "reads_in_eqclasses": 2396169786,
    "total_cbs": 48399818,
    "used_cbs": 867051,
    "initial_whitelist": 49593,
    "low_conf_cbs": 1000,
    "num_features": 5,
    "final_num_cbs": 40432,
    "deduplicated_umis": 359865640,
    "mean_umis_per_cell": 8900,
    "mean_genes_per_cell": 2814
}

Desktop (please complete the following information):

rob-p pushed a commit that referenced this issue Oct 27, 2022
@rob-p
Copy link
Collaborator

rob-p commented Oct 27, 2022

Thanks for this bug report @gringer! I have pushed a change to develop that should address this. Would you need me to produce an executable to test this out?

@gringer
Copy link
Author

gringer commented Oct 27, 2022

No, it's fine. I understand the error, it doesn't affect any of my workflow, and I can easily compensate for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants