RNA Virus Discovery Pipeline

This Bash script outlines a comprehensive bioinformatics pipeline I built during my postdoc at Universitat Autònoma de Barcelona for the discovery of RNA viruses from raw sequencing data. The pipeline includes several key steps such as read cleaning, interleaving, dereplication, decontamination, assembly, viral sequence identification, verification, and annotation.

Usage

For my purposes, I created many conda environemnts for all of the components (see commnets in main script). This depends on the user's preferences. However, all must be previously installed regardless.

To run the entire pipeline, execute the following command:

bash -i -v VIRUS_DISCOVERY_PIPELINE.sh

Added the separate scripts as well for debugging.

Pipeline Steps

Read Cleaning and Filtering
- Input: Raw paired-end FASTQ files (R1 and R2).
- Tool: fastp
- Output: Trimmed and filtered FASTQ files.
Interleave Reads
- Input: Trimmed R1 and R2 FASTQ files.
- Tool: BBTools
- Output: Interleaved FASTQ files.
Dereplication of Reads
- Input: Interleaved FASTQ files.
- Tool: CD-HIT
- Output: Dereplicated FASTQ files.
Decontamination of Reads
- Input: Dereplicated FASTQ files.
- Tool: BBTools
- Output: Decontaminated FASTQ files.
Kraken Taxonomic Classification
- Input: Decontaminated FASTQ files.
- Tool: Kraken2
- Output: Kraken classification results.
Spades Assembly --rnaviral
- Input: Decontaminated FASTQ files.
- Tool: SPAdes
- Output: Assembled contigs.
Diamond BlastX Search
- Input: Assembled contigs.
- Tool: Diamond
- Output: Diamond BlastX results and viral contigs.
Virsorter2 Classification
- Input: Assembled contigs.
- Tool: VirSorter2
- Output: VirSorter2 classification results and viral contigs.
DeepVirFinder Classification
- Input: Assembled contigs.
- Tool: DeepVirFinder
- Output: DeepVirFinder classification results and viral contigs.
Combine and Extract Significant Contigs
- Input: Viral contigs from Diamond, Virsorter2, and DeepVirFinder.
- Output: Combined viral contigs.
CheckV Validation
- Input: Combined viral contigs.
- Tool: CheckV
- Output: CheckV validation results.
vRhyme Viral Identification
- Input: Combined viral contigs.
- Tool: vRhyme
- Output: vRhyme viral identification results.
Prokka Annotation
- Input: Combined viral contigs.
- Tool: Prokka
- Output: Annotated viral contigs.

Note

Make sure to adjust file paths, database locations, and parameters as needed.
For prokka, make your own viral proteins db
Make your own contaminants db. For my case, I used bat along human and bacteria sequences given that I worked with bat nasopharyngeal samples.
Some steps are commented out and need to be corrected based on your specific requirements.

Feel free to customize and enhance the pipeline according to your specific needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RNA Virus Discovery Pipeline

Usage

Pipeline Steps

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

RNA Virus Discovery Pipeline

Usage

Pipeline Steps

Note