Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplication checks #64

Closed
GallVp opened this issue Jun 14, 2023 · 6 comments · Fixed by #89
Closed

Duplication checks #64

GallVp opened this issue Jun 14, 2023 · 6 comments · Fixed by #89
Labels
done on dev question Further information is requested
Milestone

Comments

@GallVp
Copy link
Member

GallVp commented Jun 14, 2023

From @CeciliaDeng

BTW, we haven't checked duplicate sequences in assembly, have we? I can't remember if the QC pipeline checks for mitochondria/plastids/ribosomal rna contaminations. If not, we may list them as 'todo for future release'?

@rosscrowhurst
Copy link
Collaborator

rosscrowhurst commented Jun 14, 2023

Careful with duplicates at contig level, e.g.

  • exact duplicates base for base same length should not occur
  • imperfect duplicates - not base for base identical - can be different length - might be allelic variation
  • whole genome duplications - need to differentiate these from duplications - might be that some input contigs are "similar" but not identical
  • others ....
    So just need to be careful about what you are meaning by "checking duplicate sequences"

Mitochondrial & plastid are only 'contaminants' in some contexts but part of the whole genome in others (as in you have nuclear genome, organellar genomes - combined they are the whole genome) - need to ensure not to mark mitochondrial and chloroplast genomes in themselves as contaminants if they are not.

Plant nuclear genomes have chloroplast gene insertions in them - need to make sure you are not marking these regions as contamination

@GallVp GallVp added question Further information is requested v2 labels Jun 14, 2023
@GallVp GallVp removed the v2 label Apr 16, 2024
@GallVp
Copy link
Member Author

GallVp commented May 6, 2024

Hi @CeciliaDeng

Does the above comment from Ross answer your question? Is there a tool you have in mind for duplicate detection?

@GallVp GallVp added this to the backlog milestone May 22, 2024
@CeciliaDeng
Copy link
Collaborator

Hi @GallVp and @rosscrowhurst, We encountered duplicated sequences before in our NCBI submission, in particular de novo assemblies from short reads. Yes, the duplicated seqs are usually at contig level, with exactly the same sequence but different SeqIDs. I ran 'ml seqkit; seqkit rmdup -s -o $checkedFasta $inputFasta' to remove such items

@CeciliaDeng
Copy link
Collaborator

For genomes we downloaded from public domain, sometimes there exists duplicated seqIDs and 'samtools faidx $inputFasta' will complain and exit. In that case we can use 'seqkit rmdup -n -o $outFile $inputFasta' to remove seqs with the same ID. However their sequences could be different even with the same SeqID, in that case I usually append '.1', '.2' and so on for the sequences with the same ID.

@GallVp
Copy link
Member Author

GallVp commented May 22, 2024

Thank you @CeciliaDeng

This is very useful information. I will add following to fasta validation:

  1. All sequence ids must be unique
  2. All sequences must be unique. A sequence is defined as the entire sequence and not a part of the sequence. The match must have 100% identity and coverage.

@GallVp GallVp modified the milestones: backlog, 2.0.0 May 22, 2024
@GallVp GallVp changed the title More contamination checks Duplication checks May 22, 2024
@GallVp
Copy link
Member Author

GallVp commented May 27, 2024

We are using py_fasta_validator to validate fasta files. It does detect sequence ID duplication. Please see:

https://github.com/linsalrob/py_fasta_validator/blob/32d1d2a49da550df41d44bc61be4341cdf104ae4/PyFastaValidator/validate.py#L28

@GallVp GallVp mentioned this issue May 30, 2024
10 tasks
@GallVp GallVp closed this as completed in #89 Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done on dev question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants