Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polymorphic Gene Regions #790

Open
kbergin opened this issue Sep 3, 2019 · 0 comments
Open

Polymorphic Gene Regions #790

kbergin opened this issue Sep 3, 2019 · 0 comments

Comments

@kbergin
Copy link

kbergin commented Sep 3, 2019

From Dario Strbenac on HCA Zendesk (dstr7320@uni.sydney.edu.au):

I am testing the data portal and noticed a couple of issues regarding polymorphic genes. The same algorithms are used to process RNA-seq reads for them as for all other genes. This is unsuitable because of how many variants there are in the human population and how similar these genes are to each other. For example, in recent work we found that patients which didn't have HLA-G expressed according to laboratory experiments had high counts for HLA-G by RNA-seq. Upon further investigation, we realised that the reads mapping to HLA-G had a mismatch score only 1 less than the mismatch score to HLA-A in the reference genome. The alternative approach we implemented is:

  1. Replace sequence in hg38 where HLA and KIR genes are by N to force reads not to map there.
  2. Use an RNA-seq aligner to map the reads to the modified reference genome sequence and output the unmapped reads to a separate FASTQ file.
  3. Take the unmapped reads and the IMGT HLA database (contains thousands of alleles for each gene) and use RSEM to determine where the reads should really go.
  4. Use the reads mapped to the masked reference sequence to process all other genes (i.e. the non-polymorphic ones).

We found that this approach meant that the results matched the biologists' experimental results and avoided reference sequence bias, which is usually not a problem for most of the genes in the genome which are highly conserved and don't have paralogs like HLA and KIR genes do.

AC:

  • Determine if/when/why/how we want to handle Polymorphic gene regions
  • TODO: Kylee better understand use case for doing this at all.

┆Issue is synchronized with this Jira Spike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant