Skip to content

MustoeLab/DanceMapper

Repository files navigation

DANCE-MaPper

ML clustering code and associated analysis scripts for deconvoluting RNA ensembles from single-molecule DMS-MaP datasets


Copywrite 2023 Anthony Mustoe

This project is licensed under the terms of the MIT license

Developed by:

Anthony Mustoe Lab, Baylor College of Medicine

Kevin Weeks Lab, University of North Carolina

Contact: anthony.mustoe@bcm.edu

Dependencies

Installation

  • Open externalpaths.py in a text editor and insert correct paths to dependencies

  • Compile accessoryFunctions.pyx cython routines by running: python setup.py build_ext --inplace

ShapeMapper preprocessing

DanceMapper requires initial preprocessing of sequencing reads by ShapeMapper2 (v2.2 is preferred). ShapeMapper2 should be run with the --output-parsed-mutations option. ShapeMapper2 can be obtained at https://github.com/Weeks-UNC/shapemapper2

DanceMapper.py

Run DanceMapper.py --help for complete usage information

Note that we recommend having at least 250,000 mapped reads for reliable DANCE deconvolution, and ideally

500,000 reads. Deconvolution may be possible with lower read depths, but we do not presently know the lower bound.

For PAIR and RING analysis of deconvoluted reads, we recommend having at least 1,000,000 reads, and ideally

1,000,000 reads per state.

The current script is serial (single cpu). Run times vary based RNA size, number of reads, and number of final clusters. When performing primary clustering (--fit), anticipate between 1-24 hours. When running PAIR or RING analysis (--pairmap or --ring) anticipate 12-48 hours each.

Note that DanceMapper is very memory intensive. As a rough guideline, you will need 50 x N x R bytes, where N is the RNA length and R is the # of reads. So for a 400 nt long RNA with 1M reads, this would be 20 GB. We plan to release a memory calculator tool with future releases.

Input:

parsed.mut 
    file output by ShapeMapper

profile.txt 
    file output by ShapeMapper

Output:

.bm file 
    save file of the Bernoulli mixture model. Only generated when using the --fit option

-reactivities.txt file
    normalized reactivities for each structure. Only generated when using the --fit option

[i]-rings.txt file
    RINGs for state i (window=1). Only generated when using the --ring option

[i]-pairmap.txt file
    PAIRs for state i. Only generated when using the --pairmap option

[i]-pairmap.bp file
    PAIR energy restraints for state i. Only generated when using the --pairmap option

[i]-allcorrs.txt file
    RINGs (window=3) for state i. Only generated when using the --pairmap option

Note PAIR/RING calculations have modestly changed from v1.0. To run using original parameters, use the following flags: --oldDMSnorm --pm_secondary_reactivity 0.5 --mincount 50

foldClusters.py

Script for performing RNAstructure modeling based on clustered reactivities and plotting results using ArcPlot. Takes -reactivities.txt file as input. Can also accept -pairmap.bp restraints.

foldClusters.py will generate sequence ([out].seq) and normalized dms files ([out]-[i].dms) for performing RNAstructure modeling using the -dmsnt option. (Note that no math is being done, it simply disaggregated the -reactivites.txt file).

Folding is then done using Fold, and structure models will be written as [out]-[i].ct. PK folding is also available using the --pk option. Note that PK folding will use the hierarchical foldPK script distributed as part of RNAtools, which wraps around ShapeKnots allowing discovery of multiple PKs (see RNATools README for more information). Structure models are written as CT format files (see RNAstructure documentation for details). For PK folding, multiple CT files may be generated as part of the hierarchical folding process. These are denoted as [out]-[i].1.ct, .2.ct, etc. The final solution will be named [out]-[i].f.ct

Finally, ArcPlots are generated using ArcPlot and saved as [out]-[i].pdf. PDFs show the MFE structure, the DMS reactivity profile, and PAIR data (if the --bp flag is used).

Run foldClusters.py --help for additional options and usage information

Note that the pairing probability option is currently not supported in standard distributions of RNAstructure. We are working on making this option available. Please contact us for more information in the meantime.

plotClusters.py

Script for visualizing and comparing reactivities of DanceMaP identified clusters. (Makes step plots, also known as skyline plots).

Run plotClusters.py --help for usage information

Example

Some example data and commands are provided in the example directory.

Some generic example commands are below:

Preprocess data

shapemapper --target add.fa --name example --amplicon --output-parsed --dms \
--modified --R1 example-mod_R1.fastq.gz --R2 example-mod_R2.fastq.gz \
--untreated --R1 example-neg_R1.fastq.gz --R2 example-neg_R2.fastq.gz

Run DanceMapper with PAIR and RING analysis

python DanceMapper.py --mod example_Modified_add_parsed.mut --unt example_Untreated_add_parsed.mut --prof example_add_profile.txt --out example --fit --pair --ring

Fold each ensemble state (MFE) using PAIR restraints and get arcPlot visualization, including of PAIRs

python foldClusters.py --bp example example-reactivities.txt example