Skip to content
This repository has been archived by the owner on Apr 19, 2023. It is now read-only.

Commit

Permalink
Implement #221
Browse files Browse the repository at this point in the history
Add logic to sc_h5ad_prepare_obs_filter.py
Add docs
  • Loading branch information
dweemx committed Sep 16, 2020
1 parent 43d068c commit 83d6122
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 24 deletions.
6 changes: 4 additions & 2 deletions docs/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,12 +272,14 @@ For both methods, here are the mandatory params to set:
- ``off`` should be set to ``h5ad``
- ``method`` choose either ``internal`` or ``external``
- ``filters`` is a List of Maps where each Map is required to have the following parameters:

- ``id`` is a short identifier for the filter
- ``valuesToKeepFromFilterColumn`` is array of values from the ``filterColumnName`` that should be kept (other values will be filtered out).

If ``internal`` used, the following additional params are required:

- ``filters`` is a List of Maps where each Map is required to have the following parameters:

- ``sampleColumnName`` is the column name containing the sample ID/name information. It should exist in the ``obs`` column attribute of the h5ad.
- ``filterColumnName`` is the column name that will be used to filter out cells. It should exist in the ``obs`` column attribute of the h5ad.

Expand All @@ -287,8 +289,8 @@ If ``external`` used, the following additional params are required:

- ``cellMetaDataFilePath`` is a file path pointing to a single TSV file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.
- ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **must** have unique values.
- ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
- ``filterColumnName`` is the column name from ``cellMetaDataFilePath`` which be used to filter out cells.
- `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
- `optional` ``filterColumnName`` is the column name from ``cellMetaDataFilePath`` which be used to filter out cells.


Multi-sample parameters
Expand Down
66 changes: 44 additions & 22 deletions src/utils/bin/sc_h5ad_prepare_obs_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@
import scanpy as sc
import numpy as np


def is_bool(s):
return s.lower() in ['true', 'false']


def str_to_bool(s):
if s == 'True':
return True
elif s == 'False':
return False
else:
raise ValueError


parser = argparse.ArgumentParser(description='')

parser.add_argument(
Expand Down Expand Up @@ -84,6 +98,10 @@
raise Exception("VSN ERROR: Can only handle .h5ad files.")

elif args.method == 'external':

if args.index_column_name is None:
raise Exception(f"VSN ERROR: Missing --index-column-name argument (indexColumnName param).")

metadata = pd.read_csv(
filepath_or_buffer=args.input,
header=0,
Expand All @@ -93,34 +111,38 @@
else:
raise Exception(f"VSN ERROR: The given method {args.method} is invalid.")

if args.sample_column_name not in metadata.columns:
raise Exception(f"VSN ERROR: The meta data .tsv file expects a header with a required '{args.sample_column_name}' column.")
if args.filter_column_name not in metadata.columns:
raise Exception(f"VSN ERROR: The meta data .tsv file expects a header with a required '{args.filter_column_name}' column.")
filter_mask = None
values_to_keep_from_filter_column_formatted = None

if args.filter_column_name in metadata.columns:
if args.values_to_keep_from_filter_column is None:
raise Exception(f"VSN ERROR: Missing --value-to-keep-from-filter-column argument (valuesToKeepFromFilterColumn param).")
# Convert to boolean type if needed
values_to_keep_from_filter_column_formatted = [
value_to_keep if not is_bool(s=value_to_keep) else str_to_bool(s=value_to_keep) for value_to_keep in args.values_to_keep_from_filter_column
]

def is_bool(s):
return s.lower() in ['true', 'false']
if args.method == 'internal' or len(np.unique(metadata.index)) != len(metadata.index):

if args.sample_column_name not in metadata.columns:
raise Exception(f"VSN ERROR: Missing '{args.sample_column_name}' column in obs slot of the given h5ad input file.")

def str_to_bool(s):
if s == 'True':
return True
elif s == 'False':
return False
else:
raise ValueError
if values_to_keep_from_filter_column_formatted is None:
raise Exception(f"VSN ERROR: Missing --filter-column-name argument (filterColumnName param) and/or --value-to-keep-from-filter-column (valuesToKeepFromFilterColumn param). These are required since the '{args.index_column_name}' index column does not contain unique values.")

print(f"Creating a filter mask based on '{args.sample_column_name}' and '{args.filter_column_name}'...")
filter_mask = np.logical_and(
metadata[args.sample_column_name] == args.sample_id,
metadata[args.filter_column_name].isin(values_to_keep_from_filter_column_formatted)
)
else:
if values_to_keep_from_filter_column_formatted is not None:
print(f"Creating a filter mask based only on '{args.filter_column_name}' filter column...")
filter_mask = metadata[args.filter_column_name].isin(values_to_keep_from_filter_column_formatted)
else:
print(f"No filter mask: use all the cells from the given cell-based metadata .tsv filter file...")
filter_mask = np.in1d(metadata.index, metadata.index)

# Convert to boolean type if needed
values_to_keep_from_filter_column_formatted = [
value_to_keep if not is_bool(s=value_to_keep) else str_to_bool(s=value_to_keep) for value_to_keep in args.values_to_keep_from_filter_column
]

filter_mask = np.logical_and(
metadata[args.sample_column_name] == args.sample_id,
metadata[args.filter_column_name].isin(values_to_keep_from_filter_column_formatted)
)
cells_to_keep = metadata.index[filter_mask]

# I/O
Expand Down

0 comments on commit 83d6122

Please sign in to comment.