Skip to content

Reference

Andri edited this page Oct 13, 2016 · 2 revisions

Falco Reference

Overview

Falco is a framework written in the Python programming language, that helps a user perform feature quantification in relation to genomic data. Feature quantification may normally be associated with single-cell RNA-seq data, however, Falco can also perform feature quantification on ChIP-Seq data.

Falco automates the process of feature quantification by creating a cluster of computers on the public cloud service - Amazon Web Services (AWS). This cluster utilises the AWS Elastic Map Reduce (EMR)service, which enables the automatic configuration of cluster related tools such as Hadoop map/reduce, and Apache Spark. These cluster software tools are used to perform a feature quantification analysis by using the combined power of the cluster in parallel.

The default analysis utilises the STAR alignment software, as well as the featureCount and Picard Tools to count genetic features in the analyzed data. However, Falco provides the option to use HISAT2 for aligment and/or HTSeq for quantification.

Input data, software, output data, and reports are stored on the AWS simple storage options - S3. A local resource is required to configure the various stages of jobs including splitting and interleaving input data, pre-processing of input data, and the configuration of the analysis job.

Installation

See the installation document

Configuration

Genome folder structure

This section provides information that is relevant to the immediately following sections relating to Reference Genome and Alignment Index.

Falco is designed in such a way that changing the reference genome, aligner software, or feature quantification tool should be a simple process. Part of the way this is achieved is to maintain an organised directory structure in which these reference files can be stored. The directory listing below shows an example of a directory structure required for the Falco framework.

Once the user has created any individual reference files required for their analysis, they should be copied to AWS S3 in the same structure as shown. For example, a user may have human data to analyse, and they may choose to use HISAT for the aligner. This would require the user to obtain the necessary hg38 (or different) reference files, build the HISAT index, and then upload these reference files to their AWS S3 bucket. Assuming the user has used a directory structure as shown below, they could copy this to AWS S3 with a command like: aws s3 sync genomes s3://your_bucket/.

genomes
|── hg38
|   |── genome_ref
|   |   |── GRCh38.gencode.v24.primary_assembly.annotation.gtf.gz
|   |   └── refFlat.txt.gz
|   |── star_index
|   |   |── Genome
|   |   |── SA
|   |   └── ...
|   └── hisat_index
|       |── hisat2.exon
|       |── hisat2.index.1.ht2
|       └── ...
└── mm10
|   |── genome_ref
|   |   |── gencode.vM9.primary_assembly.annotation.gtf.gz
|   |   └── refFlat.txt.gz
|   |── star_index
|   |   |── Genome
|   |   |── SA
|   |   └── ...
|   └── hisat_index
|       |── hisat2.exon
|       |── hisat2.index.1.ht2
|       └── ...

Reference Genome

As part of the bootstrapping (configuration) phase of Falco, reference genome files are copied from the user's AWS S3 bucket to the AWS EMR Cluster. The reference genome files are used in the analysis phase - counting features. Note that once the reference FASTA file is used in creating the aligner index, there is no requirement to include this .fa file in the reference genome files.

The user is required to copy the General Transfer Format (.gtf) file, and the refFlat file - a file that associates gene name with gene prediction information. These files may be stored in the .gz format. Any such files will be gunzipped by Falco.

The .fa and .gtf files can be obtained from from sources such as GENCODE or UCSC. In any case, both the .fa and .gtf files should be taken from the same source.

At the time of writing, the refFlat file for the human genome hg38 is available from UCSC.

Falco works with genomes other than human genomes. For example, mouse genomes are available from GENCODE

Alignment index

STAR index

Falco utilises the STAR software package to align transcripts to a genome. The user needs to supply the reference and index files necessary for STAR to operate. These STAR files are required to be uploaded to an AWS S3 bucket owned by the user. The instructions on how to create a STAR index are included in STAR documentation. Falco requires that the STAR index is less then 20Gb in size. A suggested command to create the necessary STAR files is (this example uses a human genome):

STAR --runMode genomeGenerate --genomeDir hg38_sparse_index/ --genomeFastaFiles GRCh38.primary_assembly.genome.fa --genomeSAsparseD 2 --sjdbGTFfile

The hg38_sparse_index/ directory should look something like:

File name Size
chrLength.txt 1.2K
chrNameLength.txt 3.2K
chrName.txt 2.0K
chrStart.txt 2.1K
exonGeTrInfo.tab 40M
exonInfo.tab 17M
geneInfo.tab 1.1M
Genome 3.0G
genomeParameters.txt 593
SA 12G
SAindex 1.5G
sjdbInfo.txt 9.8M
sjdbList.fromGTF.out.tab 8.7M
sjdbList.out.tab 8.7M
transcriptInfo.tab 12M

All the STAR files required by Falco are created in the specified directory hg38_sparse_index/. The user may wish to compress individual files using the gzip utility. Any such files will be gunzipped by Falco. This directory can then be copied to AWS S3 with the command:

aws s3 sync hg38_sparse_index s3://your_bucket/your_key

HISAT2 index

Aside from STAR, Falco also utilises HISAT2 for aligning transcripts to genome. As before, user needs to supply the reference and index file necessary for HISAT2 to operate. The instructions for creating HISAT2 index are available in STAR manual. A suggested command to create the necessary HISAT2 files is (this example uses a human genome):

hisat2-build GRCh38.primary_assembly.genome.fa hg38_hisat_index/hisat2.index

The hg38_hisat_index/ directory should look something like:

File name Size
hisat2.index.1.ht2 974M
hisat2.index.2.ht2 728M
hisat2.index.3.ht2 15K
hisat2.index.4.ht2 728M
hisat2.index.5.ht2 1.3G
hisat2.index.6.ht2 741M
hisat2.index.7.ht2 8
hisat2.index.8.ht2 8

All the HISAT2 files required by Falco are created in the specified directory hg38_hisat_index/, with hisat2.index as the prefix for the index file names. The user may wish to compress individual files using the gzip utility.
Any such files will be gunzipped by Falco. This directory can then be copied to AWS S3 with the command:

aws s3 sync hg38_hisat_index s3://your_bucket/your_key

Cluster Creator

Falco creates an AWS EMR (Elastic Map Reduce) cluster of computing resources that are configured to run Spark and Hadoop map/reduce jobs. It is up to the user to decide what type and how many nodes are required to construct the cluster. This will quite likely be determined by the size of the input data.

The user should be familiar with the various AWS instance types, and the EMR pricing model.

The configuration file used to create the EMR cluster for the Falco framework is emr_cluster.config, and has the following format.

[EMR]
release_label =
name =
log_uri =
bootstrap_scripts =
bootstrap_scripts_s3_location =
bootstrap_scripts_local_location =
upload_bootstrap_scripts =
software_installer_location =
genome_folder_location =
  • release_label - the current version of AWS EMR being used for Falco
  • name - a name for the cluster
  • log_uri - log file AWS S3 location (e.g. s3://mybucket/mylocation)
  • bootstrap_scripts_s3_location - AWS S3 location of bootstrap scripts
  • bootstrap_scripts_local_location - location of bootstrap scripts on local resource
  • upload_bootstrap_scripts - True or False; if True then automatically uploads the bootstrap scripts to the specified AWS S3 location
  • software_installer_location - AWS S3 location for software files required for installation
  • genome_folder_location - AWS S3 location for genome folder containing reference and index folders
[EMR_nodes]
key_name =
service_role =
instance_profile =
master_instance_type =
master_instance_count =
core_instance_type =
core_instance_count =
core_instance_spot =
core_instance_bid_price =
task_instance_type =
task_instance_count =
task_instance_spot =
task_instance_bid_price =
  • key_name - AWS EC2 key-pair name used to control access to an AWS EC2 instance. See key-pairs for more information.
  • service_role - AWS service role for an EC2 instance; used to control the capabilities of an instance; recommended default value is EMR_DefaultRole.
  • instance_profile - AWS profile for an EC2 instance; used to control the capabilities of an instance; recommended default value is EMR_EC2_DefaultRole.
  • master_instance_type - EC2 instance type for EMR master node
  • master_instance_count - number of master instances; this would normally be set to 1
  • core_instance_type - EC2 instance type for EMR master node
  • core_instance_count - number of core instances
  • core_instance_spot - True or False; if True spot instance types will be used
  • core_instance_bid_price - bid price for core instance - if spot is True
  • task_instance_type - EC2 instance type for EMR task nodes
  • task_instance_count - number of task instances
  • task_instance_spot - True or False; if True spot instance types will be used
  • task_instance_bid_price - bid price for task instance - if spot is True

Split Job

Typical input data may come from some sort of single cell RNAseq output. This may result in a number of files of varying sizes. The split job, splits these files into a number of chunks, with the aim to be able to evenly distribute the data within the Falco framework. This has shown to be able to significantly reduce processing time.

The splitter works by first concatinating every four lines of a record into one line. Any matching paired record is also concatinated to this line - so that two paired-end files become 1 file. The resulting file is then split into smaller chunks.

split_job is the configuration file for this job, and has the following format.

[job_config]
name =
action_on_failure =
splitter_script =
splitter_script_s3_location =
splitter_script_local_location =
upload_splitter_script =
  • name - a name for this job
  • action_on_failure - what to do if the job terminates (normally CONTINUE)
  • splitter_script - the pathless file name of the script that will be executed on EMR for this job
  • splitter_script_s3_location - the AWS S3 URI for the splitter script
  • splitter_script_local_location - the local path to the splitter script
  • upload_splitter_script - True or False; if True the splitter script will be copied from the local path to the S3 URI
[script_arguments]
manifest =
input_location =
output_location =
report_location =
region =
  • manifest - an AWS S3 URI for the manifest file
  • input_location - an AWS S3 URI for the location of the input data
  • output_location - an AWS S3 URI for the location of the output data
  • report_location - an AWS S3 URI for the location of the output report
  • region - the AWS S3 region

Note that each line in the manifest file is in the form: fname\tfname (paired end) where fname is the pathless file name of the input data

Pre-processing Job

Pre-processing is an optional step where the user may wish use the Falco pipeline to process the data in some way before it is submitted for analysis. This pre-processing step typically involves trimming reads based on quality scores, and removal of adapters. Examples of open-source software used for such a pre-processing step include Trimmomatic, Trim Galore, Cutadapt, and PRINSEQ.

The pre-processing step takes place on the AWS EMR cluster as a hadoop map/reduce job.

To execute the pre-processing step on Falco, the user must supply their own pre-processing script suitable for execution in a bash-shell Linux environment. The user script will be called from a bash wrapper script source/preprocessing/preprocess_streaming.sh.

Note that the user script is called in the following manner: ./user_script_name <arg>. Where <arg> is a single string containing either a single (fastq) file name (single-ended data), or two (fastq) file names separated by a single space. E.g the call from the wrapper bash script could be ./user_script "r1.fastq" or ./user_script "r1.fastq r2.fastq". Note the following important points:

  • all specified additional files will be copied to the current working directory of the user script
  • the fastq file(s) are also located in the current working directory - unzipped
  • the name(s) of the preprocessed fastq file(s) must be the same as the original fastq file(s) as supplied to the user script

A number of example pre-processing scripts are included with Falco, located in the source\preprocessing folder.

The configuration file for the pre-processing job is: preprocessing_job.config. This .config file takes the following sections:

[job_config]
name =
action_on_failure =
script =
script_s3_location =
script_local_location =
upload_script =
mapper_memory =
  • name - name for job; e.g. pre-processing dd/mm/yy
  • action_on_failure - action to take if job fails; normally set to CONTINUE
  • script - name of wrapper script that controls this job; the user would not normally modify this name: preprocess_streaming.sh
  • script_s3_location - an AWS S3 key that will be prepended to the script name (e.g. s3://mybucket/myscripts)
  • upload_script - True or False; if True, then the script will automatically be uploaded to the specified AWS S3 location
  • mapper_memory - required by the AWS EMR map/reduce job to determine how much memory, in Kilobytes, is required to pre-process each file (or file-pair). This will depend on the software used for pre-processing. A reasonable default is 4096. Trimmomatic requires 10240.
[script_arguments]
manifest =
input_location =
output_location =
report_location =
region =
  • manifest - AWS S3 key for manifest file - which is a text file listing the input data file names of the "interleaved" format
  • input_location - AWS S3 key for the location of the input data
  • output_location - AWS S3 key for the location of the output data
  • report_location - AWS S3 key for the location of the pre-processing output report
  • region - AWS S3 region in which the S3 bucket is located for all S3 keys
[user_script_config]
script =
supporting_files =
user_files_s3_location =
user_files_local_location =
upload_user_files =
  • script - the file name of the user pre-processing script
  • supporting_files - a comma separated list of supporting files required by the user script
  • user_files_s3_location - the AWS S3 key for the location of the user files
  • user_files_local_location - name of the directory in which the user files are located (e.g. source/preprocessing)
  • upload_user_files - True or False; if True, then all user files will be automatically uploaded to the specified AWS S3 location

Analysis Job

The analysis job is the primary job of Falco that produces the feature quantification results. The job utilises the Spark software on the EMR cluster.

The configuration file is analysis_job.config, whose format is described below.

[job_config]
name =
action_on_failure =
analysis_script =
analysis_script_s3_location =
analysis_script_local_location =
upload_analysis_script =
  • name - a name for this job
  • action_on_failure - what EMR will do if job fails; normally CONTINUE
  • analysis_script - the pathless file name of the script to run on the EMR cluster for this job
  • analysis_script_s3_location - the AWS S3 URI for the analysis script
  • analysis_script_local_location - the local path to this script
  • upload_analysis_script - True or False; if True the analysis script will be copied from the local resource to the specified AWS S3 URI
[spark_config]
driver_memory =
executor_memory =
  • driver_memory - a memory amount for the spark driver that runs on the master node; A reasonable configuration for the driver would be driver_memory = 30g
  • executor_memory - the Spark analysis job is executed in parallel by individual executors; when using STAR with sparse index 2, a reasonable configuration would be executor_memory = 30g, while for HISAT2, a reasonable configuration would be executor_memory = 11g.
[script_arguments]
input_location =
output_location =
annotation_file =
aligner_tool = 
aligner_extra_args =
counter_tool =
counter_extra_args =
run_picard =
picard_extra_args =
strand_specificity =
region = us-west-2
  • input_location - the AWS S3 URI for the location of the input data
  • output_location - the AWS S3 URI for the location of the output data
  • annotation_file - the pathless filename for the genome annotation file
  • aligner_tool - the name of the tool to be used for alignment; current options are STAR or HISAT2.
  • aligner_extra_args - any additional arguments needed by aligner; the user should not attempt to set the --runThreadN argument for STAR or -p* and --tmo argument for HISAT2 - this is controlled by Falco.
  • counter_tool - the name of the tool to be used for quantification; current options are featureCount or HTSeq.
  • counter_extra_args - any additional arguments needed by quantification tool - e.g. for featureCount, -t exon -g gene_name will result in the output containing gene names. The user can use this option to specify gene codes in the output if required. The user should not attempt to set the -T (threads) option - this is controlled by Falco.
  • run_picard - True or False; if True Picard Tools will be used for obtaining additional alignment metrics from the STAR output and the reference Flat file
  • picard_extra_args - any extra arguments required for Picard Tools
  • strand_specificity - required for Picard Tools - strand-specific library prep; valid entries are: NONE| FIRST_READ_TRANSCRIPTION_STRAND|SECOND_READ_TRANSCRIPTION_STRAND
  • region - AWS region

* The --tmo option for HISAT2 is used to ensure that the output of HISAT2 alignment is deterministic.

Running the Analysis

Analysis Output

As described in the Analysis Job section, the analysis output is stored at the specified AWS S3 URI as configured in the output_location field of the analysis_job.config file. The output consists of two files in .csv format, that can be opened using a spreadsheet tool of the user's choice. The two files are named samples_expression.csv and samples_qc_report.csv.

samples_expression.csv contains the feature counts from the analysis. The first column of this output will be either the gene name or the gene code - depending on what options the user has configured for the counter_extra_args option in the analysis_job.config file. For example, this column will contain gene names if counter_extra_args is set to -t exon -g gene_name. The subsequent columns will contain the counts.

samples_qc_report.csv is the quality report. The first column describes what each column value represents for that particular row. These row values are similar to the following if you use STAR and featureCount:

  • QC_STAR_Number_of_input_reads
  • QC_STAR_Number_of_reads_mapped_to_multiple_loci
  • QC_STAR_Number_of_reads_mapped_to_too_many_loci
  • QC_STAR_Number_of_splices:_AT/AC
  • QC_STAR_Number_of_splices:Annotated(sjdb)
  • QC_STAR_Number_of_splices:_GC/AG
  • QC_STAR_Number_of_splices:_GT/AG
  • QC_STAR_Number_of_splices:_Non-canonical
  • QC_STAR_Number_of_splices:_Total
  • QC_STAR_Uniquely_mapped_reads_number
  • QC_featureCount_assigned
  • QC_featureCount_unassigned_ambiguity
  • QC_featureCount_unassigned_chimera
  • QC_featureCount_unassigned_duplicate
  • QC_featureCount_unassigned_fragmentlength
  • QC_featureCount_unassigned_mappingquality
  • QC_featureCount_unassigned_multimapping
  • QC_featureCount_unassigned_nofeatures
  • QC_featureCount_unassigned_nonjunction
  • QC_featureCount_unassigned_secondary
  • QC_featureCount_unassigned_unmapped
  • QC_picard_coding_bases
  • QC_picard_correct_strand_reads
  • QC_picard_ignored_reads
  • QC_picard_incorrect_strand_reads
  • QC_picard_intergenic_bases
  • QC_picard_intronic_bases
  • QC_picard_pf_aligned_bases
  • QC_picard_pf_bases
  • QC_picard_utr_bases

The entries starting with QC_picard apply only if the Picard option was activated for the analysis run. Columns 2 onwards in the output file represent values for a particular sample in the analysis.

Home