PALADIN

Protein ALignment And Detection INterface

PALADIN is a protein sequence alignment tool designed for the accurate functional characterization of metagenomes.

PALADIN is based on BWA, and aligns sequences via read-mapping using BWT. PALADIN, however, offers the novel approach of aligning in the protein space. During the index phase, it processes the reference genome's nucleotide sequences and GTF/GFF annotation containing CDS entries, first converting these transcripts into the corresponding protein sequences, then creating the BWT and suffix array from these proteins. The process of translatation is skiped when providing a protein reference file (e.g., UniProt) for mapping. During the alignment phase, it attempts to find ORFs in the read sequences, then converts these to protein sequences, and aligns to the reference protein sequences.

PALADIN currently only supports single-end reads (or reads merged with FLASH, PEAR, abyss-mergepairs), and BWA-MEM based alignment. It makes use of many BWA parameters and is therefore compatible with many of its command line arguments.

PALADIN may output a standard SAM file, or a text file containing a UniProt-generated functional profile. This text file may be used for all downstream characterizations.

INSTALLATION

Dependencies

From a fresh install of Ubuntu, you will need to install build-essential libcurl4-openssl-dev git make gcc zlib1g-dev. This should be available on Ubuntu 14.04 using sudo apt-get install build-essential libcurl4-openssl-dev git make gcc zlib1g-dev
PALADIN compiles by default on OSX 10.10.x

git clone https://github.com/twestbrookunh/paladin.git
cd paladin/
make
PATH=$PATH:$(pwd)

SAMPLE COMMANDS

Download and prepare UniProt Swiss-Prot index files.

paladin prepare -r1

Download and prepare UniProt UniRef90 index files.

paladin prepare -r2

Index UniProt (or another protein) fasta, if not using the automated prepare command

paladin index -r3 uniprot_sprot.fasta.gz

Align a set of reads using 4 theads. Send the full UniProt report to paladin_uniprot.tsv.

paladin align -t 4 -o paladin index input.fastq.gz

Align a set of reads using 4 theads. Produce a bam file.

paladin align -t 4 index input.fastq.gz | samtools view -Sb - > test.bam

Align a set of reads, preferring higher quality mappings over number of proteins detected.

paladin align -T 20 -o paladin index input.fastq.gz

Align a set of reads, report secondary alignments, and generate UniProt report for both primary and secondary alignments.

paladin align -a -o paladin index input.fastq.gz

If you're intersted in trying this out on a smallish test file, try downloading this one which is from a human lung metagenome study: http://www.ebi.ac.uk/ena/data/view/PRJNA71831

#install PALADIN as per above

curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/002/SRR1177122/SRR1177122.fastq.gz
paladin prepare -r1 #unless already done
paladin align -t 4 -o lungstudy uniprot_sprot.fasta.gz SRR1177122.fastq.gz

#look at report file, SAM, etc.

OUTPUT

A SAM/BAM file that can be used for any downstream analyses.
A tab delimited UniProt report file.

#FORMAT

Count	Abundance Quality (Avg) Quality (Max) UniProtKB	ID	Organism	Protein Names	Genes	Pathway	Features	Gene Ontology	Reviewd	Existence	Comments  Cross Reference (KEGG)  Cross Reference (GeneID)  Cross Reference (PATRIC)  Cross Reference(EnsemblBacteria)

Count: The number of reads mapping to that UniProt entry
Abundance: The percentage of reads mapping to that UniProt entry
Quality (Avg): The average mapping quality for reads mapped to that UniProt entry (Phred scale, max 60)
Quality (Max): The maximum mapping quality for reads mapped to that UniProt entry (Phred scale, max 60)
UniProtKB: The ID containing the Gene short-code and species of origin
ID: The Uniprot code
Organims: The Organims from which the Uniprot ID is derived. Note that one should use this to generate a taxonomic profile of your sample
Protein Names
Genes
Pathway Features
Gene Ontology
Reviewd
Existence
Comments
Cross Reference (KEGG): Corresponding entry in KEGG database (http://www.genome.jp/kegg/)
Cross Reference (GeneID): Corresponding entry in NCBI gene database (http://www.ncbi.nlm.nih.gov/gene)
Cross Reference (PATRIC): Corresponding entry in PATRIC database (http://www.patricbrc.org)
Cross Reference (EnsemblBacteria): Corresponding entry in Ensembl Bacteria database (http://bacteria.ensembl.org)

[]

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
sample_data		sample_data
scripts		scripts
zlib		zlib
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
align.c		align.c
align.h		align.h
bntseq.c		bntseq.c
bntseq.h		bntseq.h
bwa.c		bwa.c
bwa.h		bwa.h
bwamem.c		bwamem.c
bwamem.h		bwamem.h
bwamem_extra.c		bwamem_extra.c
bwamem_pair.c		bwamem_pair.c
bwashm.c		bwashm.c
bwt.c		bwt.c
bwt.h		bwt.h
bwtindex.c		bwtindex.c
bwtindex.h		bwtindex.h
is.c		is.c
kbtree.h		kbtree.h
khash.h		khash.h
kopen.c		kopen.c
kseq.h		kseq.h
ksort.h		ksort.h
kstring.c		kstring.c
kstring.h		kstring.h
ksw.c		ksw.c
ksw.h		ksw.h
kthread.c		kthread.c
kvec.h		kvec.h
main.c		main.c
main.h		main.h
malloc_wrap.c		malloc_wrap.c
malloc_wrap.h		malloc_wrap.h
pemerge.c		pemerge.c
protein.c		protein.c
protein.h		protein.h
uniprot.c		uniprot.c
uniprot.h		uniprot.h
utils.c		utils.c
utils.h		utils.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PALADIN

INSTALLATION

SAMPLE COMMANDS

OUTPUT

About

Releases

Packages

Languages

License

dzif/paladin

Folders and files

Latest commit

History

Repository files navigation

PALADIN

INSTALLATION

SAMPLE COMMANDS

OUTPUT

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages