Skip to content

Commit

Permalink
Added additional documentation for feature_extraction and updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
burtonrj committed Nov 9, 2020
1 parent 3726fde commit c02556a
Show file tree
Hide file tree
Showing 32 changed files with 342 additions and 9 deletions.
Empty file added .Rhistory
Empty file.
45 changes: 45 additions & 0 deletions CytoPy/flow/feature_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,33 @@ def cluster_statistics(experiment: Experiment,
meta_label: str or None = None,
tag: str or None = None,
include_subject_id: bool = True):
"""
Given an Experiment and the name of a Population known
to contain clusters from some high-dimensional clustering
algorithm, this function generates a dataframe of
statistics. Details include the number of events
within the cluster and what proportion of the total events
in the Population this number represents.
Parameters
----------
experiment: Experiment
population: str (optional)
If not population is provided, will search all
possible populations for clusters
meta_label: str (optional)
If given, will filter results to include only
those clusters with this meta ID
tag: str (optional)
If given, will filter results to include only
those clusters with this tag
include_subject_id: bool (default=True)
If True, includes a column for the subject ID in
the resulting dataframe
Returns
-------
Pandas.DataFrame
"""
all_cluster_data = list()
for sample_id in experiment.list_samples():
fg = experiment.get_sample(sample_id)
Expand Down Expand Up @@ -211,6 +238,24 @@ def sort_variance(summary: pd.DataFrame,
identifier_columns: list,
value_name: str = "summary_stat",
var_name: str = "population"):
"""
Given a dataframe generated by one of the many
functions in this module, sort that dataframe
by variance.
Parameters
----------
summary: Pandas.DataFrame
Dataframe of summary statistics
identifier_columns: list
Columns to use as identifier(s) e.g. sample_id
value_name: str (default="summary_stat")
var_name: str (default="population")
Returns
-------
Pandas.DataFrame
"""
x = summary.melt(id_vars=identifier_columns,
value_name=value_name,
var_name=var_name)
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ cytometry data and a clinical/experimental endpoint, we wish to find what proper
are important for identifying a disease? What phenotypes are changing in response to a stimulus? etc).
The pipeline itself is centered around a MongoDB database, is built in the Python programming language,
and designed with a 'low code' API, greatly
simplifying cytometry analysis. We can break it all down into the following steps that can be completed within minimal
simplifying cytometry analysis. We can break it down into the following steps that can be completed with minimal
code required:

1. Data uploading
2. Pre-processing
3. Batch-effect analysis
3. Quantifying inter-sample variation and choosing training data
4. Supervised cell classification
5. High-dimensional clustering
6. Feature extraction, selection, and description
Expand All @@ -31,6 +31,8 @@ to train a classifier. Alternatively high-dimensional clustering (by PhenoGraph
cells in a completely unbiased fashion. CytoPy provides access to both methodologies as we observe
that both have benefits and failings.

CytoPy is algorithm agnostic and provides a general interface for accessing the tools provided by Scikit-Learn whilst following the terminology and signatures common to this library. If you would like to expand CytoPy and add additional methods for autonomous gating, supervised classificaiton or high dimensional clustering, please contact me at burtonrj@cardiff.ac.uk, raise an issue or make a pull request.

For more details we refer you to our pre-print <a href='https://www.biorxiv.org/content/10.1101/2020.04.08.031898v2'>manuscript</a> and software documentation. Our documentation contains
a detailed tutorials for each of the above steps (https://cytopy.readthedocs.io/)

Expand All @@ -45,9 +47,7 @@ For installing MongoDB the reader should refer to https://docs.mongodb.com/manua
CytoPy assumes that the installation is local but if a remote MongoDB database is used then a host address, port and
authentication parameters can be provided when connecting to the database, which is handled by cytopy.data.mongo_setup.

For installing Python 3 we recommend the distribution provided on <a href='https://www.python.org/downloads/'>Python.org</a> but
alternatively <a href='https://www.anaconda.com/'>Anaconda</a> can be used. We suggest that CytoPy be installed within an isolated
programming environment and suggest the environment manager <a href='https://docs.python.org/3/tutorial/venv.html'>venv.</a>
For installing Python 3 we recommend <a href='https://www.anaconda.com/'>Anaconda</a>, which also provides a convenient environment manager <a href='https://docs.python.org/3/tutorial/venv.html'>. We suggest that you always keep CytoPy contained within its own programming environment.</a>

### Installing CytoPy

Expand All @@ -69,6 +69,8 @@ CytoPy is licensed under the MIT license from the Open Source Initiative. CytoPy

In future releases we are currently interested in the following:

* A graphical user interface to open CytoPy up to a wider audience

* Incorporating data transform/normalisation procedures that mitigate or 'remove' noise as a result of batch effect.
Methods of interest include <a href='https://arxiv.org/pdf/1610.04181.pdf'>MMD-ResNet</a>,
<a href='https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1764-6'>BERMUDA</a>,
Expand Down
81 changes: 81 additions & 0 deletions docs/source/7_features.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
*************************************************
Feature extraction, selection, and summarisation
*************************************************

Once the biological samples of an experiment have been classified into phenotypically similar populations and/or clusters, we want to summarise these 'features' of the biological samples so that we can observe difference between clinical/experimental groups. CytoPy offers the feature extraction module for summarising the findings of an experiment and performing feature selection.

Summarising the proportion of cell populations/clusters
########################################################

There are multiple functions in this module for extracting and summarising the findings of an **Experiment**. The first is the *experiment_statistics* function. This takes an **Experiment** and returns a Pandas DataFrame of population statistics for each sample in the experiment::

from CytoPy.flow.feature_extraction import experiment_statistics
from CytoPy.data.project import Project
pd_project = Project.objects(project_id='Peritonitis').get()
exp = pd_project.load_experiment('PD_N_PDMCs')

exp_stats = experiment_statistics(experiment=exp, include_subject_id=True)

The resulting dataframe will have a column for the subject ID, the sample ID (FileGroup ID), the population ID, and then statistics on the number of cells in that population and the proportion of cells compared to both the immediate parent and the root population.

If we want to then label this dataframe with some meta-data associated to our subjects e.g. disease status, we can use the *meta_labelling* function. We provide it with the dataframe we have just created and the name of some variable stored in our Subject documents, and it creates a new column for this variable in the dataframe::

from CytoPy.flow.feature_extraction import meta_labelling
exp_stats = meta_labelling(experiment=exp,
dataframe=exp_stats,
meta_label="peritonitis")
The dataframe will now have a column named "peritonitis" containing a boolean value as to whether the patient had peritonitis or not.

We can generate a similar dataframe but instead look at the clustering analysis performed on a particular population. This is achieved with the `cluster_statistics` function::

cluster_stats = cluster_statistics(experiment=exp,
population="T cells")

This generates a similar dataframe as before but now each row is a cluster and additional columns are included such as population ID, cluster ID, meta label, and clustering tag. The meta label and tag can be specified as arguments to this function to filter the clusters you want. Also, *population* is optional, and if it is not provided then all populations from all FileGroups are parsed for existing clusters.


Dimensionality reduction
##########################

A rapid method for detecting if there is a 'global' difference between two experimental or clinical groups is by using dimensionality reduction and plotting data points coloured according to their group. In the example below we differentiate patients with and without acute peritonitis. The dataframe 'summary' contains the proportion of cell populations identified by XGBoost and the proportion if clusters from all our experiments combined. We can use any method from CytoPy.flow.dim_reduction for the dimensionality reduction and a scatter plot is returned with data points coloured according to some label (here it is whether a patient has peritonitis or not)::
from CytoPy.flow.feature_extraction import dim_reduction
dim_reduction(summary=summary,label='peritonitis',scale=True,method='PCA')
.. image:: images/features/pca.png

Feature selection
###################

A simple approach for eliminating redundant variables is ranking them by their variance. The summary dataframe produced by the functions previously discussed can be passed to *sort variance* which will return a sorted dataframe for convenience.

This often isn't enough, however. If the number of features is large and we want to narrow down which are of most value to predicting some clinical or experimental endpoint, we can use L1 regularisation in a sutiable linear model to do so. L1 regularisation, also known as 'lasso' regularisation, shrinks the coefficent of less important variables to zero, producing a more sparse model. By varying the regularisation term and oberving the coefficients of all our features, we can see which features shrink more rapidly compared to others. This serves as a helpful feature selection technique, giving us the variables important for predicting some clinical or experimental endpoint.

The feature extraction module contains a function for this called *l1_feature_selection*. This function takes the feature space, a dataframe of features where each row is a different biological sample and a 'label' column specifies the label to predict. We specify which features to include in our selection and the name of the label column to predict. The model takes a search space as a tuple. This is passed to Numpy.logspace to generate a range of values to use as the different L1 regularisation terms. The first value specifies the starting value and the second the end. The search space is a *n* values (where n is the third value in this argument) between the start and end on a log scale.

Finally we also provide the model to use. This must be a Scikit-Learn linear classifier that takes an L1 regularisation term as an argument 'C'. If None is given then a linear support vector machine is used as default::

l1_feature_selection(feature_space=summary,
features=features,
label='peritonitis',
scale=True,
search_space=(-2, 0, 50),
model=None)

.. image:: images/features/l1.png


We recommend exploring the API documentation for the *feature_extraction* module. Feature selection is a large and complex topic which can be approached many ways. Some additional resources worth checking out are:

* https://scikit-learn.org/stable/modules/feature_selection.html
* https://www.coursera.org/projects/machine-learning-feature-selection-in-python
* https://academic.oup.com/bioinformatics/article/23/19/2507/185254








4 changes: 1 addition & 3 deletions docs/source/8_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@ API Reference
api/CytoPy.data.subject
api/CytoPy.data.read_write
api/CytoPy.data.supervised_classifier
api/CytoPy.flow.clustering.consensus
api/CytoPy.flow.clustering.flowsom
api/CytoPy.flow.clustering.main
api/CytoPy.flow.clustering
api/CytoPy.flow.variance
api/CytoPy.flow.explore
api/CytoPy.flow.neighbours
Expand Down
11 changes: 11 additions & 0 deletions docs/source/9_license.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
CytoPy License
===============

Copyright 2020 Ross Burton

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.experiment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.experiment
=======================

.. automodule:: CytoPy.data.experiment
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.data.fcs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.data.fcs
================

.. automodule:: CytoPy.data.fcs
:members:
:inherited-members:
:show-inheritance:

7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.gate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.gate
=================

.. automodule:: CytoPy.data.gate
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.gating_strategy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.gating_strategy
============================

.. automodule:: CytoPy.data.gating_strategy
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.geometry.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.geometry
=====================

.. automodule:: CytoPy.data.geometry
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.mapping.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.mapping
====================

.. automodule:: CytoPy.data.mapping
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.population.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.population
=======================

.. automodule:: CytoPy.data.population
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.project.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.project
====================

.. automodule:: CytoPy.data.project
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.read_write.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.read_write
=======================

.. automodule:: CytoPy.data.read_write
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.setup.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.setup
==================

.. automodule:: CytoPy.data.setup
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.subject.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.subject
====================

.. automodule:: CytoPy.data.subject
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.data.supervised_classifier.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.data.supervised_classifier
==================================

.. automodule:: CytoPy.data.supervised_classifier
:members:
:inherited-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/source/api/CytoPy.feedback.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CytoPy.feedback
================

.. automodule:: CytoPy.feedback
:members:
:inherited-members:
:show-inheritance:
17 changes: 17 additions & 0 deletions docs/source/api/CytoPy.flow.clustering.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
CytoPy.flow.clustering
=======================

.. automodule:: CytoPy.flow.clustering.main
:members:
:inherited-members:
:show-inheritance:

.. automodule:: CytoPy.flow.clustering.flowsom
:members:
:inherited-members:
:show-inheritance:

.. automodule:: CytoPy.flow.clustering.consensus
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.dim_reduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.dim_reduction
==========================


.. automodule:: CytoPy.flow.dim_reduction
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.explore.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.explore
====================


.. automodule:: CytoPy.flow.explore
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.feature_extraction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.feature_extraction
===============================


.. automodule:: CytoPy.flow.feature_extraction
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.neighbours.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.neighbours
=======================


.. automodule:: CytoPy.flow.neighbours
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.plotting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.plotting
=====================


.. automodule:: CytoPy.flow.plotting
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.ref.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.ref
================


.. automodule:: CytoPy.flow.ref
:members:
:inherited-members:
:show-inheritance:
8 changes: 8 additions & 0 deletions docs/source/api/CytoPy.flow.sampling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CytoPy.flow.sampling
=====================


.. automodule:: CytoPy.flow.sampling
:members:
:inherited-members:
:show-inheritance:
Loading

0 comments on commit c02556a

Please sign in to comment.