moonshot : Metagenome assembly pipeline

I-Snakemake Pipeline for Genome Assembly, Annotation, and Pangenome Analysis

This pipeline automates the process of read filtering, genome assembly, annotation, and pangenome analysis. It is designed to process multiple samples and produce reproducible results in a structured directory.

Note: Due to the computational intensity of genome assembly, it is recommended to run this pipeline on a high-performance computing cluster (HPC). Running the pipeline on a local machine might require substantial resources (RAM, CPU threads), particularly when handling large datasets.

Requirements

Python 3.x
Snakemake
Bowtie2
Samtools
SPAdes
Prokka
Panaroo
IQ-TREE

Before running the pipeline on a cluster, make sure to load the necessary modules. You can use the following commands to load the required software on your environment:

module load Python/3.8.1 snakemake bowtie2 samtools SPAdes panaroo IQ-TREE

Note:Depending on the modules already installed in the environment, the following modules must be installed or loaded using the appropriate module load commands:

module load graalvm minced hmmer bedtools aragorn blast+ infernal prodigal ncbitools barrnap signalp cd-hit

Directory Structure

For each sample listed in list_samples.txt, the pipeline will create a structured directory with subdirectories for the outputs of each tool.

results/
  ├── sample1/
  │   ├── bowtie2/
  │   ├── spades/
  │   └── prokka/
  ├── sample2/
  │   ├── bowtie2/
  │   ├── spades/
  │   └── prokka/
  ...
panaroo/
iqtree/

Input Files

list_samples.txt: A list of sample names, one per line.
sample_correspondance.txt: A tab-delimited file mapping sample names from list_samples.txt to their corresponding FASTA filenames.

Example sample_correspondance.txt:

IMVM_0001	A4138_0001
IMVM_0002	A4138_0002

CustomDB : This database will be used to filter the reads, ensuring that only reads mapping to this database are included in the metagenome-assembled genomes (MAGs). This database typically consists of reference genomes, and can be expanded to include environmental or clinical strains. The FASTA files should be downloaded, and the database must be constructed using Bowtie2 with the following command:

bowtie2-build <reference_fasta> <CustomDB_name>

Exemple: Ecoli_customDB_Accession#: E. coli genome accesssion numbers used to build a customDB used to assemble E. coli MAGs from metagenomic data of vaginal samples. This database includes both reference and clinical strains.

Launch the Snakemake Pipeline

sbatch -p <partition> -c <number_of_cores> snakemake -p -j <number_of_jobs> -s <snakemake_file>

<partition>: The partition or resource group you want to use. <number_of_cores>: The number of cores you wish to allocate. <number_of_jobs>: The maximum number of jobs to run in parallel. <snakemake_file>: The name of the Snakemake file (usually Snakefile).

II-Quality Assessment of the generated MAGs

At the end of the assembly process, general quality metrics of the generated MAGs can be calculated using two scripts: run_stat.py and assembly_stat.py.

assembly_stat.py

This script computes statistics for a given contigs FASTA file. It parses the contigs, calculates lengths, and determines the N50 and N90, which are helpful for assessing the quality of the assembled genomes.

run_stat.py

This script iterates through sample folders containing contigs.fasta files generated during the assembly process. It invokes assembly_stat.py for each file, collects various assembly statistics, and writes the results to a CSV file. Key statistics include:

Total length of contigs greater than 0 bp and 500 bp
Number of contigs greater than 0 bp and 500 bp
Largest and smallest contig sizes
Average contig size
N50 and N90 values

Acknowledgements and contributions

We thank Thi Ngoc Anh Vu, who initially developed this pipeline under the supervision of Sean P. Kennedy. The work on assembling E. coli MAGs from vaginal samples of pregnant women was carried out by Nassim Boutouchent.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Ecoli_customDB_Accession#.txt		Ecoli_customDB_Accession#.txt
LICENSE		LICENSE
README.md		README.md
assembly_stat.py		assembly_stat.py
run_stat.py		run_stat.py
snakefile.py		snakefile.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moonshot : Metagenome assembly pipeline

I-Snakemake Pipeline for Genome Assembly, Annotation, and Pangenome Analysis

Requirements

Directory Structure

Input Files

Launch the Snakemake Pipeline

II-Quality Assessment of the generated MAGs

assembly_stat.py

run_stat.py

Acknowledgements and contributions

About

Releases

Packages

Languages

License

motleystate/moonshot

Folders and files

Latest commit

History

Repository files navigation

moonshot : Metagenome assembly pipeline

I-Snakemake Pipeline for Genome Assembly, Annotation, and Pangenome Analysis

Requirements

Directory Structure

Input Files

Launch the Snakemake Pipeline

II-Quality Assessment of the generated MAGs

assembly_stat.py

run_stat.py

Acknowledgements and contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages