Added additional documentation for feature_extraction and updated readme

burtonrj · Nov 9, 2020 · c02556a · c02556a
1 parent 3726fde
commit c02556a
Show file tree

Hide file tree

Showing 32 changed files with 342 additions and 9 deletions.
diff --git a/.Rhistory b/.Rhistory
diff --git a/CytoPy/flow/feature_extraction.py b/CytoPy/flow/feature_extraction.py
@@ -183,6 +183,33 @@ def cluster_statistics(experiment: Experiment,
                        meta_label: str or None = None,
                        tag: str or None = None,
                        include_subject_id: bool = True):
+    """
+    Given an Experiment and the name of a Population known
+    to contain clusters from some high-dimensional clustering
+    algorithm, this function generates a dataframe of
+    statistics. Details include the number of events
+    within the cluster and what proportion of the total events
+    in the Population this number represents.
+
+    Parameters
+    ----------
+    experiment: Experiment
+    population: str (optional)
+        If not population is provided, will search all
+        possible populations for clusters
+    meta_label: str (optional)
+        If given, will filter results to include only
+        those clusters with this meta ID
+    tag: str (optional)
+        If given, will filter results to include only
+        those clusters with this tag
+    include_subject_id: bool (default=True)
+        If True, includes a column for the subject ID in
+        the resulting dataframe
+    Returns
+    -------
+    Pandas.DataFrame
+    """
     all_cluster_data = list()
     for sample_id in experiment.list_samples():
         fg = experiment.get_sample(sample_id)
@@ -211,6 +238,24 @@ def sort_variance(summary: pd.DataFrame,
                   identifier_columns: list,
                   value_name: str = "summary_stat",
                   var_name: str = "population"):
+    """
+    Given a dataframe generated by one of the many
+    functions in this module, sort that dataframe
+    by variance.
+
+    Parameters
+    ----------
+    summary: Pandas.DataFrame
+        Dataframe of summary statistics
+    identifier_columns: list
+        Columns to use as identifier(s) e.g. sample_id
+    value_name: str (default="summary_stat")
+    var_name: str (default="population")
+
+    Returns
+    -------
+    Pandas.DataFrame
+    """
     x = summary.melt(id_vars=identifier_columns,
                      value_name=value_name,
                      var_name=var_name)

diff --git a/README.md b/README.md
@@ -15,12 +15,12 @@ cytometry data and a clinical/experimental endpoint, we wish to find what proper
 are important for identifying a disease? What phenotypes are changing in response to a stimulus? etc). 
 The pipeline itself is centered around a MongoDB database, is built in  the Python programming language, 
 and designed with a 'low code' API, greatly 
-simplifying cytometry analysis. We can break it all down into the following steps that can be completed within minimal 
+simplifying cytometry analysis. We can break it down into the following steps that can be completed with minimal 
 code required:
 
 1. Data uploading
 2. Pre-processing
-3. Batch-effect analysis
+3. Quantifying inter-sample variation and choosing training data
 4. Supervised cell classification 
 5. High-dimensional clustering
 6. Feature extraction, selection, and description
@@ -31,6 +31,8 @@ to train a classifier. Alternatively high-dimensional clustering (by PhenoGraph
 cells in a completely unbiased fashion. CytoPy provides access to both methodologies as we observe 
 that both have benefits and failings.
 
+CytoPy is algorithm agnostic and provides a general interface for accessing the tools provided by Scikit-Learn whilst following the terminology and signatures common to this library. If you would like to expand CytoPy and add additional methods for autonomous gating, supervised classificaiton or high dimensional clustering, please contact me at burtonrj@cardiff.ac.uk, raise an issue or make a pull request.
+
 For more details we refer you to our pre-print <a href='https://www.biorxiv.org/content/10.1101/2020.04.08.031898v2'>manuscript</a> and software documentation. Our documentation contains 
 a detailed tutorials for each of the above steps (https://cytopy.readthedocs.io/)
 
@@ -45,9 +47,7 @@ For installing MongoDB the reader should refer to https://docs.mongodb.com/manua
 CytoPy assumes that the installation is local but if a remote MongoDB database is used then a host address, port and 
 authentication parameters can be provided when connecting to the database, which is handled by cytopy.data.mongo_setup.
 
-For installing Python 3 we recommend the distribution provided on <a href='https://www.python.org/downloads/'>Python.org</a> but 
- alternatively <a href='https://www.anaconda.com/'>Anaconda</a> can be used. We suggest that CytoPy be installed within an isolated 
-programming environment and suggest the environment manager <a href='https://docs.python.org/3/tutorial/venv.html'>venv.</a>
+For installing Python 3 we recommend <a href='https://www.anaconda.com/'>Anaconda</a>, which also provides a convenient environment manager <a href='https://docs.python.org/3/tutorial/venv.html'>. We suggest that you always keep CytoPy contained within its own programming environment.</a>
 
 ### Installing CytoPy
 
@@ -69,6 +69,8 @@ CytoPy is licensed under the MIT license from the Open Source Initiative. CytoPy
 
 In future releases we are currently interested in the following:
 
+* A graphical user interface to open CytoPy up to a wider audience
+
 * Incorporating data transform/normalisation procedures that mitigate or 'remove' noise as a result of batch effect. 
 Methods of interest include <a href='https://arxiv.org/pdf/1610.04181.pdf'>MMD-ResNet</a>, 
 <a href='https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1764-6'>BERMUDA</a>,

diff --git a/docs/source/7_features.rst b/docs/source/7_features.rst
@@ -0,0 +1,81 @@
+*************************************************
+Feature extraction, selection, and summarisation
+*************************************************
+
+Once the biological samples of an experiment have been classified into phenotypically similar populations and/or clusters, we want to summarise these 'features' of the biological samples so that we can observe difference between clinical/experimental groups. CytoPy offers the feature extraction module for summarising the findings of an experiment and performing feature selection.
+
+Summarising the proportion of cell populations/clusters
+########################################################
+
+There are multiple functions in this module for extracting and summarising the findings of an **Experiment**. The first is the *experiment_statistics* function. This takes an **Experiment** and returns a Pandas DataFrame of population statistics for each sample in the experiment::
+
+	from CytoPy.flow.feature_extraction import experiment_statistics
+	from CytoPy.data.project import Project
+	pd_project = Project.objects(project_id='Peritonitis').get()
+	exp = pd_project.load_experiment('PD_N_PDMCs')
+
+	exp_stats = experiment_statistics(experiment=exp, include_subject_id=True)
+
+The resulting dataframe will have a column for the subject ID, the sample ID (FileGroup ID), the population ID, and then statistics on the number of cells in that population and the proportion of cells compared to both the immediate parent and the root population. 
+
+If we want to then label this dataframe with some meta-data associated to our subjects e.g. disease status, we can use the *meta_labelling* function. We provide it with the dataframe we have just created and the name of some variable stored in our Subject documents, and it creates a new column for this variable in the dataframe::
+
+	from CytoPy.flow.feature_extraction import meta_labelling
+	exp_stats = meta_labelling(experiment=exp, 
+				    dataframe=exp_stats, 
+				    meta_label="peritonitis")
+	
+The dataframe will now have a column named "peritonitis" containing a boolean value as to whether the patient had peritonitis or not.
+
+We can generate a similar dataframe but instead look at the clustering analysis performed on a particular population. This is achieved with the `cluster_statistics` function::
+
+	cluster_stats = cluster_statistics(experiment=exp,
+					    population="T cells")
+
+This generates a similar dataframe as before but now each row is a cluster and additional columns are included such as population ID, cluster ID, meta label, and clustering tag. The meta label and tag can be specified as arguments to this function to filter the clusters you want. Also, *population* is optional, and if it is not provided then all populations from all FileGroups are parsed for existing clusters.
+
+
+Dimensionality reduction
+##########################
+
+A rapid method for detecting if there is a 'global' difference between two experimental or clinical groups is by using dimensionality reduction and plotting data points coloured according to their group. In the example below we differentiate patients with and without acute peritonitis. The dataframe 'summary' contains the proportion of cell populations identified by XGBoost and the proportion if clusters from all our experiments combined. We can use any method from CytoPy.flow.dim_reduction for the dimensionality reduction and a scatter plot is returned with data points coloured according to some label (here it is whether a patient has peritonitis or not)::
+	
+	from CytoPy.flow.feature_extraction import dim_reduction
+	dim_reduction(summary=summary,label='peritonitis',scale=True,method='PCA')
+	
+.. image:: images/features/pca.png
+
+Feature selection
+###################
+
+A simple approach for eliminating redundant variables is ranking them by their variance. The summary dataframe produced by the functions previously discussed can be passed to *sort variance* which will return a sorted dataframe for convenience.
+
+This often isn't enough, however. If the number of features is large and we want to narrow down which are of most value to predicting some clinical or experimental endpoint, we can use L1 regularisation in a sutiable linear model to do so. L1 regularisation, also known as 'lasso' regularisation, shrinks the coefficent of less important variables to zero, producing a more sparse model. By varying the regularisation term and oberving the coefficients of all our features, we can see which features shrink more rapidly compared to others. This serves as a helpful feature selection technique, giving us the variables important for predicting some clinical or experimental endpoint.
+
+The feature extraction module contains a function for this called *l1_feature_selection*. This function takes the feature space, a dataframe of features where each row is a different biological sample and a 'label' column specifies the label to predict. We specify which features to include in our selection and the name of the label column to predict. The model takes a search space as a tuple. This is passed to Numpy.logspace to generate a range of values to use as the different L1 regularisation terms. The first value specifies the starting value and the second the end. The search space is a *n* values (where n is the third value in this argument) between the start and end on a log scale. 
+
+Finally we also provide the model to use. This must be a Scikit-Learn linear classifier that takes an L1 regularisation term as an argument 'C'. If None is given then a linear support vector machine is used as default::
+
+	l1_feature_selection(feature_space=summary,
+		             features=features,
+		             label='peritonitis',
+		             scale=True,
+		             search_space=(-2, 0, 50),
+		             model=None)
+
+.. image:: images/features/l1.png
+
+
+We recommend exploring the API documentation for the *feature_extraction* module. Feature selection is a large and complex topic which can be approached many ways. Some additional resources worth checking out are:
+
+* https://scikit-learn.org/stable/modules/feature_selection.html
+* https://www.coursera.org/projects/machine-learning-feature-selection-in-python
+* https://academic.oup.com/bioinformatics/article/23/19/2507/185254
+
+
+
+
+
+
+
+
diff --git a/docs/source/8_reference.rst b/docs/source/8_reference.rst
@@ -16,9 +16,7 @@ API Reference
     api/CytoPy.data.subject
     api/CytoPy.data.read_write
     api/CytoPy.data.supervised_classifier
-    api/CytoPy.flow.clustering.consensus
-    api/CytoPy.flow.clustering.flowsom
-    api/CytoPy.flow.clustering.main
+    api/CytoPy.flow.clustering
     api/CytoPy.flow.variance
     api/CytoPy.flow.explore
     api/CytoPy.flow.neighbours

diff --git a/docs/source/9_license.rst b/docs/source/9_license.rst
@@ -0,0 +1,11 @@
+CytoPy License
+===============
+
+Copyright 2020 Ross Burton
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
diff --git a/docs/source/api/CytoPy.data.experiment.rst b/docs/source/api/CytoPy.data.experiment.rst
@@ -0,0 +1,7 @@
+CytoPy.data.experiment
+=======================
+
+.. automodule:: CytoPy.data.experiment
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.fcs.rst b/docs/source/api/CytoPy.data.fcs.rst
@@ -0,0 +1,8 @@
+CytoPy.data.fcs
+================
+
+.. automodule:: CytoPy.data.fcs
+    :members:
+    :inherited-members:
+    :show-inheritance:
+
diff --git a/docs/source/api/CytoPy.data.gate.rst b/docs/source/api/CytoPy.data.gate.rst
@@ -0,0 +1,7 @@
+CytoPy.data.gate
+=================
+
+.. automodule:: CytoPy.data.gate
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.gating_strategy.rst b/docs/source/api/CytoPy.data.gating_strategy.rst
@@ -0,0 +1,7 @@
+CytoPy.data.gating_strategy
+============================
+
+.. automodule:: CytoPy.data.gating_strategy
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.geometry.rst b/docs/source/api/CytoPy.data.geometry.rst
@@ -0,0 +1,7 @@
+CytoPy.data.geometry
+=====================
+
+.. automodule:: CytoPy.data.geometry
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.mapping.rst b/docs/source/api/CytoPy.data.mapping.rst
@@ -0,0 +1,7 @@
+CytoPy.data.mapping
+====================
+
+.. automodule:: CytoPy.data.mapping
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.population.rst b/docs/source/api/CytoPy.data.population.rst
@@ -0,0 +1,7 @@
+CytoPy.data.population
+=======================
+
+.. automodule:: CytoPy.data.population
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.project.rst b/docs/source/api/CytoPy.data.project.rst
@@ -0,0 +1,7 @@
+CytoPy.data.project
+====================
+
+.. automodule:: CytoPy.data.project
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.read_write.rst b/docs/source/api/CytoPy.data.read_write.rst
@@ -0,0 +1,7 @@
+CytoPy.data.read_write
+=======================
+
+.. automodule:: CytoPy.data.read_write
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.setup.rst b/docs/source/api/CytoPy.data.setup.rst
@@ -0,0 +1,7 @@
+CytoPy.data.setup
+==================
+
+.. automodule:: CytoPy.data.setup
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.subject.rst b/docs/source/api/CytoPy.data.subject.rst
@@ -0,0 +1,7 @@
+CytoPy.data.subject
+====================
+
+.. automodule:: CytoPy.data.subject
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.data.supervised_classifier.rst b/docs/source/api/CytoPy.data.supervised_classifier.rst
@@ -0,0 +1,7 @@
+CytoPy.data.supervised_classifier
+==================================
+
+.. automodule:: CytoPy.data.supervised_classifier
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.feedback.rst b/docs/source/api/CytoPy.feedback.rst
@@ -0,0 +1,7 @@
+CytoPy.feedback
+================
+
+.. automodule:: CytoPy.feedback
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.clustering.rst b/docs/source/api/CytoPy.flow.clustering.rst
@@ -0,0 +1,17 @@
+CytoPy.flow.clustering
+=======================
+
+.. automodule:: CytoPy.flow.clustering.main
+    :members:
+    :inherited-members:
+    :show-inheritance:
+
+.. automodule:: CytoPy.flow.clustering.flowsom
+    :members:
+    :inherited-members:
+    :show-inheritance:
+
+.. automodule:: CytoPy.flow.clustering.consensus
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.dim_reduction.rst b/docs/source/api/CytoPy.flow.dim_reduction.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.dim_reduction
+==========================
+
+
+.. automodule:: CytoPy.flow.dim_reduction
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.explore.rst b/docs/source/api/CytoPy.flow.explore.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.explore
+====================
+
+
+.. automodule:: CytoPy.flow.explore
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.feature_extraction.rst b/docs/source/api/CytoPy.flow.feature_extraction.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.feature_extraction
+===============================
+
+
+.. automodule:: CytoPy.flow.feature_extraction
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.neighbours.rst b/docs/source/api/CytoPy.flow.neighbours.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.neighbours
+=======================
+
+
+.. automodule:: CytoPy.flow.neighbours
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.plotting.rst b/docs/source/api/CytoPy.flow.plotting.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.plotting
+=====================
+
+
+.. automodule:: CytoPy.flow.plotting
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.ref.rst b/docs/source/api/CytoPy.flow.ref.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.ref
+================
+
+
+.. automodule:: CytoPy.flow.ref
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/source/api/CytoPy.flow.sampling.rst b/docs/source/api/CytoPy.flow.sampling.rst
@@ -0,0 +1,8 @@
+CytoPy.flow.sampling
+=====================
+
+
+.. automodule:: CytoPy.flow.sampling
+    :members:
+    :inherited-members:
+    :show-inheritance: