Skip to content

scATAC seq data processing

Hyunsoo Kim edited this page Feb 9, 2023 · 3 revisions

scATAC-seq data processing

Step 1: align sequences in scATAC-seq FASTQ files to GRCh38 reference genome by 10x Genomics cellranger-atac count to obtain two fragments.tsv.gz files for two samples.

Step 2: Make the following directoy structure with copy or link.

../count_male-bc
├── Patient1
│   ├── outs
│   │   ├── fragments.tsv.gz
│   │   └── fragments.tsv.gz.tbi
└── Patient2
    └── outs
        ├── fragments.tsv.gz
        └── fragments.tsv.gz.tbi

Step 3: Make an ArchRProject object for two samples with the following command:

./make_sc-atac-seq_archr_obj.R --n_cores 60 --dir_count ../count_male-bc --dir_output ./output_male-bc --dir_seurat_obj ./dir_seurat_obj_male-bc --keep_most_common_cell_type_for_each_cluster --max_dimstouse 30 --seurat_resolution 0.2 --min_tss 0 --min_frags 1000 --colorlim 1.5 --umap_n_neighbors 30 --umap_min_dist 0.3 --umap_metric cosine --umap_metric_doubletscores cosine --harmony_theta 0 --harmony_lambda 20 male-bc

This will take care of all samples under ../count_male-bc. The contents of the output directory of "./output_male-bc" follows:

output_male-bc
├── archr_output
│   ├── Annotations
│   ├── ArrowFiles
│   ├── Embeddings
│   ├── LSI_ATAC
│   ├── Peak2GeneLinks
│   ├── PeakCalls
│   │   ├── InsertionBeds
│   │   └── ReplicateCalls
│   ├── Plots
│   └── RNAIntegration
│       └── GeneIntegrationMatrix_ArchR
├── log
├── pdf
├── png
│   ├── male-bc_Patient1_qc.png
│   ├── male-bc_Patient2_qc.png
│   └── ...
├── qc
│   ├── Patient1
│   │   ├── Patient1-Doublet-Summary.pdf
│   │   ├── Patient1-Fragment_Size_Distribution.pdf
│   │   └── Patient1-TSS_by_Unique_Frags.pdf
│   └── Patient2
│       ├── Patient2-Doublet-Summary.pdf
│       ├── Patient2-Fragment_Size_Distribution.pdf
│       └── Patient2-TSS_by_Unique_Frags.pdf
├── rds
│   └── male-bc_archrproj_obj_final.rds
├── tmp
└── xlsx
    └── male-bc_sc-atac-seq_pipeline_summary.xlsx

Let's review the content of QC files (male-bc_Patient1_qc.png and male-bc_Patient2_qc.png).

Step 4: Add peak-to-gene links to the final ArchRProject object and perform motif enrichment analysis by the following command:

./analyze_cancer_specific_p2g.R --n_cores 4 --dir_output ./output_p2g_male-bc --dir_seurat_obj ./dir_seurat_obj_male-bc --dir_archr_output ./output_male-bc/archr_output --subset_archrproject_force_to_update --max_dimstouse 30 --harmony_theta 0 --seed_kmeans 4 --exclude_cluster_epi_unassigned --exclude_cluster_highly_overlapped_with_normal_peaks --exclude_cluster_low_nfrags male-bc

This will subset the final ArchRProject in order to remove clustes of unassigned epithelail cells by InferCNV and low mean of nFrags. The peak-to-gene links were obtained by addPeak2GeneLinks() (see document ArchR book/peak2genelinkage-with-archr). The contents of the output directory of "./output_p2g_male-bc" follows:

output_p2g_male-bc
├── archr_output
│   ├── Annotations
│   ├── ArrowFiles
│   ├── Embeddings
│   ├── LSI_ATAC
│   ├── Peak2GeneLinks
│   │   ├── seATAC-Group-KNN.rds
│   │   └── seRNA-Group-KNN.rds
│   ├── PeakCalls
│   │   ├── ...
│   │   ├── InsertionBeds
│   │   ├── ReplicateCalls
│   │   ├── X0.Epi..Tumor-reproduciblePeaks.gr.rds
│   │   ├── X2.Epi..Tumor-reproduciblePeaks.gr.rds
│   │   └── ...
│   ├── Plots
│   ├── RNAIntegration
│   └── Save-ArchR-Project.rds
├── log
├── pdf
│   ├── scatterplot_cisbp_motif_up.normal-vs-cancer_cancer-specific_enhancer.pdf
│   ├── venn_chippeakanno_overlaps_of_enhancer_peaks.pdf
│   └── ...
├── png
├── rds
│   ├── all_p2g_observed.rds
│   ├── cancer_enriched_enhancer_p2g_table.rds
│   ├── cancer_specific_enhancer_p2g_table.rds
│   ├── cancer_specific_p2g_table_degs.rds
│   ├── find_overlaps_of_enhancer_peaks_output_overlappingpeaks_obj.rds
│   ├── male-bc_archrproj_obj_p2gs.rds
│   ├── markerpeaks.for_comparison.normal-vs-cancer_cancer-specific_enhancer.rds
│   ├── p2g.df.sub.plot_enhancer.rds
│   ├── proj.archr.for_comparison.normal-vs-cancer_cancer-specific_enhancer.rds
│   ├── proj.archr.for_comparison.normal-vs-cancer_non-cancer-specific_enhancer.rds
│   ├── venn_chippeakanno_overlaps_of_enhancer_peaks.rds
│   ├── ...
├── tmp
├── tsv
│   ├── cancer_specific_enhancer_p2g_table.tsv
│   ├── peakannoenrichment_cisbp_motif_up.normal-vs-cancer_non-cancer-specific_enhancer.tsv
│   └── ...
└── xlsx
    └── male-bc_sc-atac-seq_cancer_specific_p2g_summary.xlsx

The final ArchProject object (male-bc_archrproj_obj_p2gs.rds) is located under output_p2g_male-bc/rds. The peak2gene data.frame for all enhancers related with epithelial cells was stored at cancer_enriched_enhancer_p2g_table.rds. The peak2gene data.frame for cancer specific enhancers was stored at cancer_specific_enhancer_p2g_table.rds. The cancer specific enhancers were defined by subtracting cancer enriched enhancers by peak2gene links overlapped with enhancer peaks in normal epithelial cells (i.e. human mamary epithelial cells (HMEC) H3K27ac peaks in our case).

Clone this wiki locally