Support hi/lo gene search #1

tomwhite · 2018-10-19T11:18:48Z

So far we have a simple search where genes are on or off, and for this we can index the sparse gene expression matrix. To do more sophisticated search where genes are hi/lo, we need the normalized matrix, and a more elaborate indexing scheme.

E.g. refering to http://stm.sciencemag.org/content/10/461/eaau4711

We first defined a flow cytometric strategy to identify the known B cell subsets and plasma cells in intestinal mucosa and in circulation, identifying plasma cells as live CD45⁺CD38^hiCD27⁺ cells and nonplasma cell B cells as live CD45⁺CD38⁻CD19⁺ cells. Among nonplasma cell B cells, naïve B cells were defined as CD45⁺CD38⁻CD19⁺IgD⁺IgM⁺ cells, whereas switched memory (SM) B cells were defined as CD45⁺CD38⁻CD19⁺IgD⁻IgM⁻ cells (fig. S2).

So [CD45⁺CD38⁻CD19⁺] would be a possible search query.

tomwhite · 2018-10-19T11:39:24Z

Here's a first go at how we might implement this.

For the simple binary scheme already implemented, each cell has a genes field with the gene ids that are non-null in it. We then do a must with match query with a subset of gene ids.

Now, as an example, if CD45 has gene id 45 and CDG38 has gene id 38, then the query [CD45⁺ CDG38⁺] is [genes=45 and genes=38] in ES search terms. Note that we can’t search for a low value such as CD38^-, since we can only search for the presence of a gene, not its absence.

To support hi/lo and +/- we first have to define what these mean, presumably they are conventions about percentiles (or standard deviations) in the normalized matrix. E.g. + means anything greater than zero (i.e. positive sd) and - is negative sd, hi is >= 1 sd, and so on.

Then at indexing time, we have a single field genes again, but values are bucketed by the sd range they fall in. For example, if the value is between 0 and 1 sd, assign it the bucket +1, if it's between 1 and 2 sd, assign it to +2, and if it's over 2 sd assign it to +3. Similarly for values less than 0. The bucket is then appended to the gene id, so 45+1 means the values that are between 0 and 1 sd for gene 45.

Then the query [CD45⁺ CDG38^-] becomes [(genes=45+1 or genes=45+2 or genes=45+3) and (genes=38-1 or genes=38-2 or genes=38-3)] in ES search terms.

Similarly, [CD45⁺ CDG38^hi] becomes [(genes=45+1 or genes=45+2 or genes=45+3) and (genes=38+2 or genes=38+3)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hi/lo gene search #1

Support hi/lo gene search #1

tomwhite commented Oct 19, 2018

tomwhite commented Oct 19, 2018 •

edited

Loading

Support hi/lo gene search #1

Support hi/lo gene search #1

Comments

tomwhite commented Oct 19, 2018

tomwhite commented Oct 19, 2018 • edited Loading

tomwhite commented Oct 19, 2018 •

edited

Loading