Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fst input format #336

Open
NTNguyen13 opened this issue Aug 19, 2020 · 1 comment
Open

Fst input format #336

NTNguyen13 opened this issue Aug 19, 2020 · 1 comment

Comments

@NTNguyen13
Copy link

NTNguyen13 commented Aug 19, 2020

Hi, I'm currently calculating Fst using scikit-allel module.

I tried to use a pandas dataframe with format (n_variants, n_samples), each cell is a list of genotype, like this:

                                           
site                  HG00096   HG00097  HG00099  HG00100  HG00101                                                       
1:20915411-["G","A"]  [0, 0]  [0, 0]  [0, 0]  [0, 0]  [0, 0]
1:20915418-["C","G"]  [0, 0]  [0, 0]  [1, 0]  [0, 1]  [1, 0]
1:20915441-["G","A"]  [0, 0]  [0, 0]  [1, 0]  [0, 0]  [0, 0]

but scikit allele does not accept this format, and return the error:

a, b, c = allel.weir_cockerham_fst(test2, subpops)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Untitled-2 in 
----> 39 a, b, c = allel.weir_cockerham_fst(test2, subpops)

~/anaconda3/envs/hail/lib/python3.7/site-packages/allel/stats/fst.py in weir_cockerham_fst(g, subpops, max_allele, blen)
    107         g = GenotypeArray(g, copy=False)
    108     if g.ndim != 3:
--> 109         raise ValueError('g must have three dimensions')
    110     if g.shape[2] != 2:
    111         raise NotImplementedError('only diploid genotypes are supported')

ValueError: g must have three dimensions

Please advice me on how to process the right format for this function. At this moment I extract the genotype from VCF files, then I split the genotype by '|' and converting it to int, I wonder if there are native methods to read vcf to genotype array exists

@hardingnj
Copy link
Collaborator

Hi @NTNguyen13

You are looking for: https://scikit-allel.readthedocs.io/en/stable/io.html

If your VCF contains multiple chromosomes, you will need to supply the region=?? argument to select a single chromosome.

Thanks for your question, but generally, this issue tracker is for potential bugs. User queries should be addressed to https://groups.google.com/g/scikit-allel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants