Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add states() method to return actual genotype states as characters #2617

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Oct 30, 2022

Description

Adds variant.genotype_values(). See tskit-dev/tsinfer#739

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.72%. Comparing base (d6b2303) to head (efa8d08).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2617   +/-   ##
=======================================
  Coverage   89.72%   89.72%           
=======================================
  Files          29       29           
  Lines       31578    31589   +11     
  Branches     6117     6121    +4     
=======================================
+ Hits        28333    28344   +11     
  Misses       1853     1853           
  Partials     1392     1392           
Flag Coverage Δ
c-tests 86.55% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.98% <ø> (ø)
python-tests 99.01% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/genotypes.py 99.15% <100.00%> (+0.08%) ⬆️

@hyanwong
Copy link
Member Author

I'm not sure why this is failing CI: everything is covered AFAICS.

Ready to merge, I reckon, pending approval of the name of the function.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - not sure about the name though.

python/tskit/genotypes.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Also not clear that an Object array is what people would want. Would it be better to have a "missing data symbol" or something like we do for haplotypes?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 1, 2022

Also not clear that an Object array is what people would want. Would it be better to have a "missing data symbol" or something like we do for haplotypes?

ISWYM, but using any particular character will mean that we can't use that for an allele type (which we might conceivably want to do). This is only going to bite for missing data, so I don't think it's too bad TBH.

@jeromekelleher
Copy link
Member

Hmm - how about writing some client code for this with and without missing data? Something like the verify function in tsinfer would be a good one. We definitely don't want to have to have different code paths for dtypes with and without missing data.

We use "N" as the default missing data character elsewhere and its fine. We should be able to use this function to compare easily with e.g. alignments.

Is the np.array(list(ts.alignments()) equal to np.array(var.states() for var in ts.variants()).T with and without missing data?

@benjeffery
Copy link
Member

@hyanwong Would you like this to be in 0.5.4?

@hyanwong
Copy link
Member Author

hyanwong commented Jan 4, 2023

Ah, good point. There's no hurry from my PoV. Is it just a question of my changing the name? I can do that easily enough.

@hyanwong hyanwong force-pushed the variant-strings branch 2 times, most recently from 9f178fe to 3b884b0 Compare January 7, 2023 18:17
@jeromekelleher
Copy link
Member

I don't think we should hurry this in as we don't want to get stuck with Object arrays unless we have to. They have subtly different semantics to arrays of unicode.

We should have some example client code for this before committing ourselves to a particular representation.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

Sure. I have updated this so it should never return an object array anyway. But no hurry to get this in. It was to help @szhan and others who spent a fair bit of time comparing integer arrays between tree sequence before realising that they weren't actually comparable.

@hyanwong hyanwong changed the title Add genotype_values() method Add states() method to return actual genotype states as characters Jan 24, 2023
@hyanwong
Copy link
Member Author

hyanwong commented Sep 24, 2024

I think this also fixes #2181 . The name was changed to .states() after discussion with @jeromekelleher, but in #2181 he also suggested genotypes_as_alleles?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants