Skip to content
/ SCOTCH Public

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing

License

Notifications You must be signed in to change notification settings

WGLab/SCOTCH

Repository files navigation

SCOTCH

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing. See our pre-print at here. All codes and analysis results in the manuscript can be found at here

Background

Recent development involving long-read single-cell transcriptome sequencing (lr-scRNA-Seq) represents a significant leap forward in single-cell genomics. With the recent introduction of R10 flowcells by Oxford Nanopore, computational methods should now shift focus on harnessing the unique benefits of long reads to analyze transcriptome complexity. In this context, we introduce a comprehensive suite of computational methods named Single-Cell Omics for Transcriptome CHaracterization (SCOTCH). Our method is compatible with the single-cell library preparation platform from 10X Genomics, Pacbio Biosciences, and Parse Biosciences, facilitating the analysis of special cell populations, such as neurons, hepatocytes and developing cardiomyocytes.

SCOTCH offers both a preprocessing and a statistical pipeline. The preprocessing pipeline accepts BAM files with tagged barcodes generated by vendor-supplied pipelines (e.g., wf-single-cell) and aligns reads to both known and novel isoforms. SCOTCH can leverage existing high-quality gene annotations, update these annotations using the data to enhance novel isoform identification, or operate entirely annotation-free. SCOTCH outputs count matrices at both the gene and transcript levels. The statistical pipeline supports the analysis of differential transcript usage (DTU), which is determined by the relative abundances of transcripts within the same gene.

Part1: Preprocessing Pipeline

Installation

To avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh

After installing conda, SCOTCH sources can be downloaded.

git clone https://github.com/WGLab/SCOTCH.git
cd SCOTCH
conda env create --name SCOTCH --file scotch.yml
conda activate SCOTCH

Run SCOTCH preprocessing pipeline

main_preprocessing.py is the primary function for running the SCOTCH preprocessing pipeline. This pipeline consists of four steps: (1) generating gene annotation files, (2) generating the compatibility matrix, (3) generating the count matrix, and (4) summarizing novel gene annotations. We will cover more details below.

Input and Output

In general, SCOTCH accepts BAM file(s) tagged by vendor-supplied pipelines for its preprocessing pipeline (For ONT data, please see here for more details), and output a series of information including read-isoform mappings, gene count matrix, isoform count matrix, and updated gene annotations. The --bam argument is used to assign input file(s), and --target is used to assign the output folder(s).

The SCOTCH pipeline allows users to process one sample at a time or multiple samples simultaneously depending on study design. When preprocessing multiple samples together, SCOTCH generates a unified gene/isoform annotation based on all BAM files. This approach is particularly beneficial in studies involving multiple samples, as it ensures consistent identification and consolidation of novel isoforms across different samples. Some usage examples are below:

Example1: --bam accepts a single bam file

--bam sample1.bam --target path/to/sample1/output

Example2: --bam accepts multiple bam files, each from one sample

--bam sample1.bam sample2.bam --target path/to/sample1/output path/to/sample2/output

Example3: --bam accepts a folder of bam files from one sample

--bam path/to/sample1/bamfiles/folder --target path/to/sample1/output

Example4: --bam accepts multiple bam folders, each from one sample

--bam path/to/sample1/bamfiles/folder path/to/sample2/bamfiles/folder --target path/to/sample1/output path/to/sample2/output

Below is the directory structure of output under each target folder you can expect:

  • auxiliary/: This folder contains information of read-isoform mappings, cell barcodes, UMI, and exon coordinates of Isoforms.

  • bam/: Information of BAM files preprocessed, including read names, barcodes, read length etc.

  • compatible_matrix/: This directory holds the compatibility matrices of read - isoform alignments.

  • reference/: This folder includes the reference files used or generated by SCOTCH.

  • count_matrix/: This folder holds the count matrices on both gene and isoform levels generated by SCOTCH.

Step1: prepare annotation file

In this step, SCOTCH will generate annotation files for the reference genome and tagged BAM files. Set --tast annotation to run this step.

SCOTCH offers three modes for generating gene annotations:

  1. Annotation-Only Mode: SCOTCH can rely entirely on existing gene annotations. This mode allows for the discovery of novel isoforms defined by combinations of known exons. This mode will fail to identify novel isoforms involving unknown exons, such as intron retention. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf_off.
  2. Semi-Annotation Mode: SCOTCH can use BAM files from one or multiple samples to update and refine existing gene annotations. This mode allows for the discovery of de novo (sub)exons with more types of novel isoforms than annotation-only mode. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf.
  3. Annotation-Free Mode: SCOTCH can generate gene and isoform annotations based solely on BAM files, allowing for the discovery of novel genes and isoforms. Set --reference None and --reference_pkl None.

arguments

  • --bam: path(s) to the bam file(s)/folder(s)
  • --workers: number of threads for parallel computing.
  • --coverage_threshold_exon: coverage threshold to support exon discovery, percentage to the maximum coverage, larger values will be more conservative, default is 0.02.
  • --coverage_threshold_splicing: threshold to support splicing discovery, percentage to the maximum splicing junctions, larger values will be more conservative, default is 0.02.
  • --z_score_threshold: z score threshold to discovery sharp changes of read coverage, larger values will be more conservative, default is 10.

Below is an example of generating annotation in the Semi-Annotation Mode for two samples simultanuously. See example/annotation.sh for an implementation in slurm.

python3 src/main_preprocessing.py \
--task annotation \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--reference path/to/genes.gtf \
--workers 30

Step2: generate compatible matrix

In this step, SCOTCH aligns reads to existing gene isoforms and identifies novel isoforms based on the annotation files generated in step 1. Set --tast 'compatible matrix' to run this step. A compatibility matrix is generated for each gene, where each row represents a read and each column represents a gene isoform. Detailed information on read-isoform mappings can be found in the auxiliary folder. To speed up this step, job arrays can be submitted in SLURM to process genes in parallel. An example implementation for 100 job arrays is provided in example/compatible.sh.

  • --target: the same with step1
  • --bam the same with step1
  • --match exon percentage threshold to call a read-exon mapping/unmapping. For example, setting 0.2 means reads covers >80% of the exon length as mapped, and reads covers <20% of the exon length as unmapped.
  • --total_jobs: the number of batches to split genes and run in parallel
  • --job_index: the batch/job index current task to run
  • --cover_existing: overwrite existing results, default. --cover_existing_false: do not overwrite existing results, only generate compatible matrix for genes do not have compatible matrix results.
python3 src/main_preprocessing.py \
--task 'compatible matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--match 0.2 

Step3: generate count matrix

In this step, SCOTCH will generate gene- and isoform-level copunt matrix. Set --tast 'count matrix' to run this step. See example/count.sh for an implementation in SLURM.

  • --target: the same with step1
  • '--novel_read_n' filter out novel isoform is the number of reads mapped for the sample less than novel_read_n, and reads mapped to this novel isoform will be treated as uncategorized. Default is 20.
  • --workers: number of threads for parallel computing.
python3 src/main_preprocessing.py \
--task 'count matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--workers 20

Step4: update gene annotations (optional)

Finally, if you ran annotation-free or semi-annotation mode, SCOTCH will generate a new gene annotation file. Set --tast summary to run this step. See example/summary.sh for an implementation in SLURM.

  • --reference: set to None if using annotation free mode; set to known annotation gtf file path if using semi-annotation mode.
python3 src/main_preprocessing.py \
--task 'summary' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2
--reference path/to/reference/genes.gtf

Part2: Statistical Pipeline

Installation

In R, run below codes to install SCOTCH statistical pipeline.

if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
install_github("WGLab/SCOTCH")

DTU analysis

Below is sample codes for DTU analysis. Example data can be found in /data folder.

library(SCOTCH)

#----read gene-level count matrix-----#
sample8_CD4_gene=as.matrix(read.csv('gene_count_1.csv',row.names = 'X'))
sample8_CD8_gene=as.matrix(read.csv('gene_count_2.csv',row.names = 'X'))

#----read transcript-level count matrix-----#
sample8_CD4_transcript=as.matrix(read.csv("transcript_count_1.csv",row.names = 'X'))
gene_transcript_CD4_df = data.frame(genes=str_remove(colnames(sample8_CD4_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD4_transcript))

sample8_CD8_transcript=as.matrix(read.csv("transcript_count_2.csv",row.names = 'X'))
gene_transcript_CD8_df = data.frame(genes=str_remove(colnames(sample8_CD8_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD8_transcript))

#----gene-level analysis-----#
df_gene = scotch_gene(sample8_CD4_gene, sample8_CD8_gene), epsilon=0.01,ncores=10)%>%
  filter(pct1>=0.01|pct2>=0.01)

#----transcript-level analysis-----#
df_transcript = scotch_transcript(gene_transcript_CD4_df,gene_transcript_CD8_df, 
                                  sample8_CD4_transcript, sample8_CD8_transcript, ncores=10)
df_scotch = df_gene%>%left_join(df_transcript,by=c('genes'='gene'))

SCOTCH will output the following statistics in results:

genes: gene name
pct1: percent of cells expression the gene in cell population 1
pct2: percent of cells expression the gene in cell population 2
logFC: fold change in log scale
p_gene: p value on the gene level
pct_diff: difference between pct1 and pct2
p_gene_adj: adjusted p value of differential gene expression
isoforms: isoform name
alpha1: estimated parameter of the dirichlet distribution for cell population 1
alpha2: estimated parameter of the dirichlet distribution for cell population 2
TU1: relative transcript usage of cell population 1
TU2: relative transcript usage of cell population 2
TU_diff: difference of TU1 and TU2
TU_var1: variance of TU1
TU_var2: variance of TU2
dispersion1: dispersion parameter for cell population 1
dispersion2: dispersion parameter for cell population 2
isoform_switch: whether there is isoform switching event
isoform_switch_ES: effect size of isoform switching event
p_DTU_gene: p value for differential transcript usage analysis for the whole gene
p_transcript: p value for differential transcript usage analysis for the specific transcript
p_transcript_adj: adjusted p value for differential transcript usage analysis for the specific transcript
p_DTU_gene_adj: adjusted p value for differential transcript usage analysis for the whole gene

About

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published