Skip to content

Commit

Permalink
update (#3901)
Browse files Browse the repository at this point in the history
Co-authored-by: atarashansky <atarashansky@CZIMACOS3990.hsd1.ma.comcast.net>
  • Loading branch information
atarashansky and atarashansky authored Jan 10, 2023
1 parent 7446608 commit 6233ef3
Showing 1 changed file with 5 additions and 4 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Find Marker Genes

Genes that are highly enriched in a particular group of cells can be used as "markers" for that population. Marker genes are commonly used to identify cell types in single-cell RNA-seq data. In Gene Expression, we can use the Find Marker Genes tool to identify genes that are specific to a particular cell type relative to other cell types in its tissue.
Genes that are highly enriched in a particular group of cells can be used as markers for that population. Marker genes are commonly used to identify cell types in single-cell RNA-seq data. In Gene Expression, we can use the Find Marker Genes tool to identify genes that are specific to a particular cell type relative to other cell types in its tissue.


## Tutorial
Expand Down Expand Up @@ -28,7 +28,7 @@ For a selected cell type in a tissue,
- Sort the genes by their marker score in descending order and return the top 25.

## Welch's t-test
Welch's t-test is a non-parametric test for comparing the means of two independent samples with unequal sample sizes and variances. For a particular gene,
Welch's t-test is a statistical test for comparing the means of two independent samples with unequal sample sizes and variances. For a particular gene,
- Let two groups of cells be $c_1$ and $c_2$. Calculate the following values:
- $m_1$ and $m_2$ are the average expression of the gene in $c_1$ and $c_2$ respectively.
- $s_1$ and $s_2$ are the standard deviations of the gene in $c_1$ and $c_2$ respectively.
Expand All @@ -42,9 +42,10 @@ Welch's t-test is a non-parametric test for comparing the means of two independe
- Calculate the p-value using the t-distribution ($P(T >t)$).

## Caveats
- It is important to note that some methodological decisions were made to balance accuracy with efficiency and scalability. For example, we use a t-test to perform differential expression, which is a simple and fast test. However, it may not be as accurate as more sophisticated (and computationally intensive) tests, such as the Wilcoxon rank-sum test.
- It is important to note that some methodological decisions were made to balance accuracy with efficiency and scalability. For example, we use a t-test to perform differential expression, which is a simple and fast test. However, it may not be as accurate as more sophisticated (and computationally intensive) statistical tests.
- While differential expression is typically performed on raw counts for single-cell RNA sequencing data, we opted to use the <NextLink href="04__Analyze%20Public%20Data/4_2__Gene%20Expression%20Documentation/4_2_3__Gene%20Expression%20Data%20Processing">rankit-normalized</NextLink> values in order to identify marker genes that corroborate the gene expressions shown in the dot plot.
- For the beta, we opted to not report the p-values for each gene. Though we can calculate approximate p-values using the t-statistic and degrees of freedom calculated in *Welch's t-test*, it is difficult to accurately aggregate the p-values for each gene across all comparisons. The most reliable method involves repeatedly permuting the data to generate a null distribution, but this is computationally intensive.
- Currently, marker genes are only calculated for cell types in tissues using only _healthy_ cells. Applying secondary filters to the data (like disease, ethnicity, etc.) does not affect the results. Enabling dynamic calculation of marker genes for arbitrary populations of cells in arbitrary subsets of the data may be a direction for future development.
- Ideal marker genes for a particular cell type are expressed in all of its cells and not expressed anywhere else: they have binary expression patterns. In reality, almost no genes have truly binary expression patterns. Instead, they have genes that are statistically enriched in their cells relative to all other cell types. Additionally, genes may be good markers for a cell type in one context (e.g. a tissue) but not another (e.g. the entire human body).
- Finally, it may be difficult to identify any good markers for some cell types, especially if they have a small number of cells or are compared to many similar cell types. To account for the former scenario, marker genes are not displayed for cell types with fewer than 25 cells. The latter scenario is particularly relevant in the blood as it contains many closely-related cell types. As a result, we have temporarily disabled the find marker genes feature for cell types in blood.
### Marker genes are not available for blood or small populations of cells
It may be difficult to identify any good markers for some cell types, especially if they have a small number of cells or are compared to many similar cell types. To account for the former scenario, marker genes are not displayed for cell types with fewer than 25 cells. The latter scenario is particularly relevant in the blood as it contains many closely-related cell types. As a result, we have temporarily disabled the find marker genes feature for cell types in blood.

0 comments on commit 6233ef3

Please sign in to comment.