Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hi/lo gene search #1

Open
tomwhite opened this issue Oct 19, 2018 · 1 comment
Open

Support hi/lo gene search #1

tomwhite opened this issue Oct 19, 2018 · 1 comment

Comments

@tomwhite
Copy link
Member

So far we have a simple search where genes are on or off, and for this we can index the sparse gene expression matrix. To do more sophisticated search where genes are hi/lo, we need the normalized matrix, and a more elaborate indexing scheme.

E.g. refering to http://stm.sciencemag.org/content/10/461/eaau4711

We first defined a flow cytometric strategy to identify the known B cell subsets and plasma cells in intestinal mucosa and in circulation, identifying plasma cells as live CD45+CD38hiCD27+ cells and nonplasma cell B cells as live CD45+CD38CD19+ cells. Among nonplasma cell B cells, naïve B cells were defined as CD45+CD38CD19+IgD+IgM+ cells, whereas switched memory (SM) B cells were defined as CD45+CD38CD19+IgDIgM cells (fig. S2).

So [CD45+CD38CD19+] would be a possible search query.

@tomwhite
Copy link
Member Author

tomwhite commented Oct 19, 2018

Here's a first go at how we might implement this.

For the simple binary scheme already implemented, each cell has a genes field with the gene ids that are non-null in it. We then do a must with match query with a subset of gene ids.

Now, as an example, if CD45 has gene id 45 and CDG38 has gene id 38, then the query [CD45+ CDG38+] is [genes=45 and genes=38] in ES search terms. Note that we can’t search for a low value such as CD38-, since we can only search for the presence of a gene, not its absence.

To support hi/lo and +/- we first have to define what these mean, presumably they are conventions about percentiles (or standard deviations) in the normalized matrix. E.g. + means anything greater than zero (i.e. positive sd) and - is negative sd, hi is >= 1 sd, and so on.

Then at indexing time, we have a single field genes again, but values are bucketed by the sd range they fall in. For example, if the value is between 0 and 1 sd, assign it the bucket +1, if it's between 1 and 2 sd, assign it to +2, and if it's over 2 sd assign it to +3. Similarly for values less than 0. The bucket is then appended to the gene id, so 45+1 means the values that are between 0 and 1 sd for gene 45.

Then the query [CD45+ CDG38-] becomes [(genes=45+1 or genes=45+2 or genes=45+3) and (genes=38-1 or genes=38-2 or genes=38-3)] in ES search terms.

Similarly, [CD45+ CDG38hi] becomes [(genes=45+1 or genes=45+2 or genes=45+3) and (genes=38+2 or genes=38+3)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant