Skip to content

Commit

Permalink
Merge pull request #228 from AlexsLemonade/allyhawkins/rms-singler-an…
Browse files Browse the repository at this point in the history
…alysis

Compare cell type annotations for RMS samples to SingleR annotations
  • Loading branch information
allyhawkins authored Jul 14, 2023
2 parents 8152a52 + 30c2336 commit d81a1c2
Show file tree
Hide file tree
Showing 6 changed files with 3,926 additions and 3 deletions.
42 changes: 42 additions & 0 deletions celltype_annotation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Cell type annotation

This folder holds analysis and scripts developed for exploring cell type annotation methods.

## Analysis

Inside the analysis folder are a set of template notebooks that have been used for exploring cell type annotations using different methods.

1. `SingleR-non-immune-ref-comparison.Rmd`: This notebook was used to compare cell type annotations obtained from `SingleR`.
We compared the annotations when using references with or without immune cell populations.
In particular, this notebook is specifically useful for looking at cell type annotation in blood tumors.

2. `SingleR-combining-refs.Rmd`: This notebook specifically compares using multiple `celldex` references to using a single `celldex` reference to annotate libraries with `SingleR`.
Again, this notebook specifically is for blood tumors.

3. `marker-gene-celltype-exploration.Rmd`: This notebook looks at using other marker gene based methods, specifically `scType` and `CellAssign`.

4. `cell-assign-sarcoma.Rmd`: Here we specifically look at using `CellAssign` with Rhabdomyosarcoma samples from `SCPCP000005`.
This analysis compares the assignments using `CellAssign` to the submitter annotations.

5. `cell-assign-sarcoma-marker-genes.Rmd`: This notebook also compare annotations using `CellAssign` to manual annotations cell type annotations in Rhabdomyosarcoma samples from `SCPCP000005`.
However instead of using public references for marker gene sets, we use `scran::FindMarkers()` to build a custom marker gene table for the specified library.

6. `singler-rms-comparison.Rmd`: This notebook compares annotations from `SingleR` to manual annotations in Rhabdomyosarcoma samples from `SCPCP000005`.

## Scripts

This folder contains any scripts used for cell type annotation.

1. `cell-assign.py`: This script can be used to perform cell type annotation using `CellAssign`.
The output will be a table of predictions.
2. `remove-cell-refs.py`: This script is used to remove a set of cell types from a `celldex` reference.
This requires a txt file with a list of cell types to remove.
The default is the `immune_celltypes.txt` file in the root directory of this repo.
3. `render-SingleR-reports.R`: This script is used to render the `SingleR-non-immune-ref-comparison.Rmd` report across an entire project.

## Utils

This folder holds helper functions that were used across multiple analysis templates.

1. `cellassign-helper-functions.R`: This script contains any helper functions used for analyzing cell type annotations obtained from `CellAssign`.
2. `singler-helper-functions.R`: his script contains any helper functions used for analyzing cell type annotations obtained from `SingleR`.
244 changes: 244 additions & 0 deletions celltype_annotation/analysis/singler-rms-comparison.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: "SingleR: Rhabdomyosarcoma cell type comparison for `r {params$library_id}`"
output:
html_notebook:
toc: true
toc_float: true
params:
s3_singler_data_dir: "s3://nextflow-ccdl-results/scpca/processed/results/SCPCP000005"
s3_submitter_data_dir: "s3://sc-data-integration/scpca/celltype_sce"
local_data_dir: "../data"
library_id: "SCPCL000488"
sample_id: "SCPCS000262"
---

The goal of this notebook is to compare using `SingleR` for cell type annotation to manual annotations in Rhabdomyosarcoma samples.
For each SCE object, we annotated cell types using the `SingleR` workflow in `scpca-nf v0.5.2`.
This adds a cell type annotation for all available `celldex` references (`BlueprintEncodeData`, `ImmuneCellExpressionData`, `HumanPrimaryCellAtlasData`, and `MonacoImmuneData`).
Additionally, we created references from `BlueprintEncodeData` and `HumanPrimaryCellAtlasData` that do not contain immune cells.
For each reference, cell types were obtained using all available reference labels, `label.main`, `label.fine`, and `label.ont`.

We are mostly concerned with looking at the results from using `BlueprintEncodeData` and `HumanPrimaryCellAtlasData` as those references contain more than just immune cells and will be more relevant for the RMS samples.

In this notebook we do the following:

- Look at the distribution of cell type annotations with `SingleR` vs. manual
- Directly compare how each cell is annotated between `SingleR` and manual
- Calculate the median delta score across all `SingleR` annotations

## Set Up

```{r}
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(ggplot2)
})
theme_set(theme_bw())
# source in helper functions for plotting
source(file.path("..", "utils", "cellassign-helper-functions.R"))
source(file.path("..", "utils", "singler-helper-functions.R"))
```

```{r}
# make local data directory if it doesn't exist
if(!dir.exists(params$local_data_dir)){
dir.create(params$local_data_dir, recursive = TRUE)
}
# build path to singler and submitter annotated sce file
singler_annotated_sce_file <- glue::glue("{params$library_id}_annotated.rds")
submitter_annotated_sce_file <- glue::glue("{params$library_id}_processed_celltype.rds")
local_singler_sce_path <- file.path(params$local_data_dir,
params$sample_id,
singler_annotated_sce_file)
local_submitter_sce_path <- file.path(params$local_data_dir,
params$sample_id,
submitter_annotated_sce_file)
# if missing any of the annotated sce files, grab them from S3
if(!(file.exists(local_singler_sce_path))){
aws_includes <- glue::glue('--include "{singler_annotated_sce_file}"')
sync_call <- glue::glue('op run -- aws s3 sync {params$s3_singler_data_dir}/{params$sample_id} {params$local_data_dir}/{params$sample_id} --exclude "*" {aws_includes}')
system(sync_call, ignore.stdout = TRUE)
}
if(!(file.exists(local_submitter_sce_path))){
sync_call <- glue::glue('op run -- aws s3 sync {params$s3_submitter_data_dir} {params$local_data_dir}/{params$sample_id} --exclude "*" --include "{submitter_annotated_sce_file}"')
system(sync_call, ignore.stdout = TRUE)
}
# read in all annotated sce
singler_annotated_sce <- readr::read_rds(local_singler_sce_path)
submitter_annotated_sce <- readr::read_rds(local_submitter_sce_path)
```


```{r}
# add submitter annotations to singler annotated object
singler_annotated_sce$submitter_annotations <- submitter_annotated_sce$celltype
# remove extra objects
rm(submitter_annotated_sce)
```

## Define Functions

```{r}
# This function is used to create a table with cell type assignments for SingleR
# This also assumes there is a `broad_celltype` column that comes from a submitter
# Input is a dataframe from the column data containing all the cell type annotations
# and the label type to grab (e.g., label.main, label.fine, or label.ont)
# output is a dataframe of just the label types indicated across all references used
# annotations in the broad_celltype column are also included in the final output
# we also identify the top 10 represented cell types
celltype_assignments_table <- function(coldata_df,
label_type,
num_celltypes = 10){
celltype_assignments <- coldata_df |>
tidyr::pivot_longer(cols = c(broad_celltype, starts_with(label_type)),
names_to = "reference",
values_to = "celltype") |>
# do some string replacements to deal with a few similar assignments across references
dplyr::mutate(reference = stringr::str_replace(reference, paste0(label_type, "_"), ""),
celltype = stringr::str_replace(celltype, "Monocytes", "Monocyte"),
celltype = stringr::str_replace(celltype, "Endothelial_cells", "Endothelial cells"),
celltype_top = forcats::fct_lump_n(celltype, num_celltypes))
# level for plotting later
references <- unique(celltype_assignments$reference)
celltype_assignments <- celltype_assignments |>
dplyr::mutate(reference_factor = forcats::fct_relevel(reference, "broad_celltype"))
return(celltype_assignments)
}
```

## Overall celltype assignments

This section examines the overall distribution of cell type assignments across all references with `SingleR` and manual annotations.
We look at this using a stacked bar chart.

The original submitter cell types contained sub groups of tumor cells, e.g., `Tumor_myoblast-A` and `Tumor_myoblast-B`.
Before comparing annotations, the subgroups are combined into a single cell type, e.g., `Tumor_myoblast`.
These will be stored in the `broad_celltype` column and used throughout the remainder of the notebook.

```{r}
# extract coldata for easy plotting
coldata_df <- colData(singler_annotated_sce) |>
as.data.frame() |>
# classify cells as tumor or normal
dplyr::mutate(cell_category = dplyr::if_else(stringr::str_detect(submitter_annotations, "Tumor"), "Tumor", submitter_annotations),
# remove sub states of myoblast/mesoderm/myocyte
broad_celltype = stringr::str_remove(submitter_annotations, "-\\w$"))
```

For all the `label.ont` reference labels, we will grab the human-readable cell ontology labels.
These labels will be used for plotting for the remainder of the notebook.

```{r}
# grab cell label names corresponding to cell ontology ids
cl_ont <- ontoProc::getCellOnto()
# rename all label.ont columns with full cell ontology label
ont_columns_idx <- stringr::str_starts(colnames(coldata_df), "label.ont")
cl_columns <- coldata_df[,ont_columns_idx] |>
purrr::map(
\(cl_id)
cl_ont$name[cl_id]
) |>
dplyr::bind_cols()
coldata_df[,ont_columns_idx] <- cl_columns
```

```{r}
# create a table of cell type assignments for each label type
labels <- c("label.main", "label.fine", "label.ont")
celltype_assignment_list <- purrr::map(labels,
\(label_type) celltype_assignments_table(coldata_df,
label_type)) |>
purrr::set_names(labels)
```


```{r, fig.width=10, fig.height=7}
# print out barcharts
purrr::imap(celltype_assignment_list,
\(celltype_assignments, label_type) celltype_barchart(celltype_assignments,
label_type))
```


## Comparison of SingleR annotations to submitter annotations

This section will directly compare cell type annotations using `HumanPrimaryCellAtlasData` or `BlueprintEncodeData` with `SingleR` vs. manual annotations.
Each chunk displays a heatmap directly comparing a `SingleR` reference (columns) to manual annotations (rows).

```{r, fig.height=5}
compare_refs_heatmap(original_assignment = coldata_df$broad_celltype,
cell_assign = coldata_df$label.ont_BlueprintEncodeData,
title = "BlueprintEncodeData - label.ont")
```

```{r}
compare_refs_heatmap(original_assignment = coldata_df$broad_celltype,
cell_assign = coldata_df$label.main_BlueprintEncodeData,
title = "BlueprintEncodeData - label.main")
```
```{r, fig.height=7, fig.width=7}
compare_refs_heatmap(original_assignment = coldata_df$broad_celltype,
cell_assign = coldata_df$label.ont_HumanPrimaryCellAtlasData,
title = "HumanPrimaryCellAtlasData - label.ont")
```

```{r, fig.height=5, fig.width=7}
compare_refs_heatmap(original_assignment = coldata_df$broad_celltype,
cell_assign = coldata_df$label.main_HumanPrimaryCellAtlasData,
"HumanPrimaryCellAtlasData - label.ont")
```

## Median delta score

The last thing we will look at is the distribution of the delta between the top and median scores (referred to as the delta median here).
We can grab the complete `SingleR` results from the `metadata` of the annotated SCE object.
These results contain a matrix of all the scores for each cell and for each label.
We expect that a reference that is more appropriate will have higher delta medians.

```{r warning=FALSE, message=FALSE, fig.height=7}
# first just grab the singleR objects so we can directly use those as input to the plotScoreHeatmap function
singler_results <- metadata(singler_annotated_sce)$singler_results
median_delta_results <- singler_results |>
purrr::map(get_median_delta) |>
dplyr::bind_rows(.id = "reference") |>
tidyr::separate(reference, into = c("label_type", "reference"), sep = "_", extra = "merge") |>
# classify each label as pruned or not
dplyr::mutate(pruned = ifelse(is.na(pruned.labels), "Yes", "No"))
# plot the distribution of median_delta for each reference, faceting by label type (main, fine, ont)
plot_metric_distribution(median_delta_results, "median_delta") +
theme(legend.position = "bottom")
```

## Conclusions

- `BlueprintEncodeData` seems like the most appropriate reference when looking at the direct comparison of cell type labels with `SingleR` to manual.
This is mostly based on the presence of skeletal muscle cells obtained from `SingleR` annotations, whereas muscle cells don't appear to be present in `HumanPrimaryCellAtlasData`.
- There appears to be a higher proportion of `NA` cells that correspond to tumor cells than we saw with the AML samples.
- Generally I think we could use `SingleR` with solid tumors, but I think we will still want to be careful about which reference we use for each disease type.
I don't think we are going to find a single reference that works across all of our samples, although `BlueprintEncodeData` does seem to generally be doing a decent job here.
Although I'm making that conclusion mostly based on _vibes_...

## Session Info

```{r}
sessionInfo()
```
3,428 changes: 3,428 additions & 0 deletions celltype_annotation/analysis/singler-rms-comparison.nb.html

Large diffs are not rendered by default.

12 changes: 9 additions & 3 deletions celltype_annotation/utils/cellassign-helper-functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,12 @@ get_celltype_assignments <- function(predictions){

# function for creating comparison heatmaps between two different annotations
compare_refs_heatmap <- function(original_assignment,
cell_assign){
cell_assign,
title = NULL){
if(is.null(title)){
title <- ""
}

# build a matrix with reference as the rows and the celltype as the columns
label_mtx <- table(original_assignment,
cell_assign,
Expand All @@ -27,6 +32,7 @@ compare_refs_heatmap <- function(original_assignment,
pheatmap::pheatmap(label_mtx,
cluster_rows = TRUE,
width = 10,
fontsize_col = 8)
fontsize_col = 8,
main = title)

}
}
74 changes: 74 additions & 0 deletions celltype_annotation/utils/singler-helper-functions.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# This function is used to create a stacked bar chart for all SingleR references of a given label type
# label type is either label.fine, label.main, or label.ont

celltype_barchart <- function(celltype_assignments,
label_type){

top_celltypes <- celltype_assignments |>
dplyr::count(reference_factor, celltype_top) |>
tidyr::drop_na()

ggplot(top_celltypes, aes(x = reference_factor, fill = celltype_top, y = n, label = n)) +
geom_bar(position = "stack", stat = "identity") +
labs(
x = "Reference dataset",
y = "Number of cells",
fill = "Cell type",
title = label_type
) +
geom_text(size = 3, position = position_stack(vjust = 0.5)) +
guides(x = guide_axis(angle = 90))

}

# creates a plot showing the distribution of the desired metric with reference on the x-axis
# color by whether or not labels are pruned or not
# facet by label type (main, fine, or ont)
# we will use this function for looking at delta, median delta, and score distributions
plot_metric_distribution <- function(all_results,
metric_type){

metric_plot <- ggplot(all_results, aes_string(x = "reference", y = metric_type, group = "reference", color = "pruned")) +
ggforce::geom_sina(size = 0.1, alpha = 0.2) +
stat_summary(
aes(group = reference),
color = "black",
# median and quartiles for point range
fun = "median",
fun.min = function(x) {
quantile(x, 0.25)
},
fun.max = function(x) {
quantile(x, 0.75)
},
geom = "pointrange",
position = position_dodge(width = 0.9),
size = 0.1
) +
facet_wrap(vars(label_type))+
guides(x = guide_axis(angle = 90),
colour = guide_legend(override.aes = list(size = 2, alpha = 1)))

return(metric_plot)

}


# function to calculate the median delta for each singleR result
get_median_delta <- function(singler_result){

# grab all the scores
scores <- singler_result$scores |>
as.matrix()

# calculate the median delta (max score - median of all scores)
median_delta <- rowMaxs(scores) - rowMedians(scores)

# add median into the singler result and return as data frame
median_delta_result <- singler_result |>
as.data.frame() |>
dplyr::select(labels, delta.next, pruned.labels) |>
dplyr::mutate(median_delta = median_delta)

return(median_delta_result)
}
Loading

0 comments on commit d81a1c2

Please sign in to comment.