Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All variants are intergenic with NCBI GFF #1620

Open
dzc0104 opened this issue Feb 23, 2024 · 11 comments
Open

All variants are intergenic with NCBI GFF #1620

dzc0104 opened this issue Feb 23, 2024 · 11 comments

Comments

@dzc0104
Copy link

dzc0104 commented Feb 23, 2024

Hi,
I am attempting to annotate a customized VCF file using NCBI's GFF and (fna) FASTA files for the Newcastle disease virus (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_004786615.1/). However, I've observed that all the variants are being classified as intergenic. But this is not true, when viewed in IGV.

System

  • VEP version:104.3
  • VEP Cache version: N/A
  • Perl version: N/A
  • OS: Linux
  • tabix installed

###Script
#To install the bgzip and tabix (I did it in my local terminal)
#Download htslib-1.19.1.tar.gz
tar -zxvf htslib-1.19.1.tar.gz
cd htslib-1.19.1

#removing header line of gff as vep does not work with files having header line (local terminal)
grep -v '^#' genomic.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip > genomic.gff.gz
tabix -p gff genomic.gff.gz

#for compressing fasta file (local terminal and transfer all the files in super computer later)
bgzip -c GCF_004786615.1_ASM478661v1_genomic.fna > GCF_004786615.1_ASM478661v1_genomic.fna.gz
#for indexing fasta file
samtools faidx GCF_004786615.1_ASM478661v1_genomic.fna.gz

#creating a synonyms file that maps the chromosome names used in your VCF to those used in your GFF file
zcat iso1_filtered.snp.vcf.gz | grep -v '^#' | sort -k1,1 -o sorted_iso1.vcf
cut -f1 sorted_iso10.vcf > 1snpsynonyms.txt
zcat genomic.gff.gz | grep -v '^#' | sort -k1,1 -o sorted.gff

#variants annotation for snp using ASM4786615.1
vep -i iso1_filtered.snp.vcf.gz --gff /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz --fasta /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/GCF_004786615.1_ASM478661v1_genomic.fna.gz --synonyms 1snpsynonyms.txt --species avian_orthoavulavirus

Full error message

I have not got any warning message as the script ran but the output file was with all intergenic variants.

Data files

A sample of the GFF after
NC_075404.1 RefSeq region 1 15186 . + . ID=NC_075404.1:1..15186;Dbxref=taxon:2560319;country=United Kingdom: N. Ireland;gbkey=Src;genome=genomic;isolate=chicken/N. Ireland/Ulster/67;mol_type=genomic RNA;old-name=Newcastle disease virus
NC_075404.1 RefSeq gene 56 1801 . + . ID=gene-QKC91_gp1;Dbxref=GeneID:80527638;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=QKC91_gp1
NC_075404.1 RefSeq CDS 122 1591 . + 0 ID=cds-YP_010790286.1;Parent=gene-QKC91_gp1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1 RefSeq gene 1804 3254 . + . ID=gene-QKC91_gp2;Dbxref=GeneID:80527633;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding;locus_tag=QKC91_gp2
.....

A sample of the compressed VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT iso1
NODE_1_length_6008_cov_909.877255 980 . T C 12078.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.924;DP=624;ExcessHet=0.0000;FS=1.120;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=19.87;ReadPosRankSum=0.149;SOR=0.728 GT:AD:DP:GQ:PL 0/1:236,372:608:99:12086,0,6929
NODE_1_length_6008_cov_909.877255 3666 . C T 15573.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.079;DP=770;ExcessHet=0.0000;FS=7.765;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.88;ReadPosRankSum=0.795;SOR=0.362 GT:AD:DP:GQ:PL 0/1:235,511:746:99:15581,0,5829
NODE_1_length_6008_cov_909.877255 3812 . A G 534.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=1.096;DP=826;ExcessHet=0.0000;FS=15.515;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=0.66;ReadPosRankSum=-12.298;SOR=2.487 GT:AD:DP:GQ:PL 0/1:722,85:807:99:542,0,23105
NODE_1_length_6008_cov_909.877255 4631 . T C 1817.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=-3.725;DP=846;ExcessHet=0.0000;FS=22.208;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.24;ReadPosRankSum=-13.945;SOR=1.685 GT:AD:DP:GQ:PL 0/1:680,133:813:99:1825,0,21905
NODE_2_length_2668_cov_848.858356 289 . G A 924.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.811;DP=720;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.97;MQRankSum=0.000;QD=1.50;ReadPosRankSum=-5.861;SOR=0.631 GT:AD:DP:GQ:PL 0/1:531,87:618:99:932,0,16256
.....

Synonyms text file format
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_2_length_2668_cov_848.858356 NC_075404.1
NODE_2_length_2668_cov_848.858356 NC_075404.1
.....

VEP output

ENSEMBL VARIANT EFFECT PREDICTOR v104.3

Output produced at 2024-02-09 19:23:53

Using API version 104, DB version ?

ensembl-funcgen version 104.f1c7762

ensembl-io version 104.1d3bb6e

ensembl version 104.1af1dce

ensembl-variation version 104.20f5335

Column descriptions:

Uploaded_variation : Identifier of uploaded variant

Location : Location of variant in standard coordinate format (chr:start or chr:start-end)

Allele : The variant allele used to calculate the consequence

Gene : Stable ID of affected gene

Feature : Stable ID of feature

Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature

Consequence : Consequence type

cDNA_position : Relative position of base pair in cDNA sequence

CDS_position : Relative position of base pair in coding sequence

Protein_position : Relative position of amino acid in protein

Amino_acids : Reference and variant amino acids

Codons : Reference and variant codon sequence

Existing_variation : Identifier(s) of co-located known variants

Extra column keys:

IMPACT : Subjective impact classification of consequence type

DISTANCE : Shortest distance from variant to transcript

STRAND : Strand of the feature (1/-1)

FLAGS : Transcript quality flags

SOURCE : Source of transcript

genomic.gff.gz : /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz (overlap)

#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
NODE_1_length_6008_cov_909.877255_980_T/C NODE_1_length_6008_cov_909.877255:980 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_3666_C/T NODE_1_length_6008_cov_909.877255:3666 T - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_3812_A/G NODE_1_length_6008_cov_909.877255:3812 G - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_4631_T/C NODE_1_length_6008_cov_909.877255:4631 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER
....

@nuno-agostinho
Copy link
Contributor

nuno-agostinho commented Feb 23, 2024

Hey @dzc0104,

Thank you for your question. The problem is related with using the NCBI GTF/GFF annotation for microorganisms: we currently require the GTF/GFF annotation to explicitly describe the transcript and its exons.

For your use case, you could use the following modified annotation:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM478661v1
#!genome-build-accession NCBI_Assembly:GCF_004786615.1
##sequence-region NC_075404.1 1 15186
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2560319
NC_075404.1	RefSeq	region	1	15186	.	+	.	ID=NC_075404.1:1..15186;Dbxref=taxon:2560319;country=United Kingdom: N. Ireland;gbkey=Src;genome=genomic;isolate=chicken/N. Ireland/Ulster/67;mol_type=genomic RNA;old-name=Newcastle disease virus
NC_075404.1	RefSeq	gene	56	1801	.	+	.	ID=gene-QKC91_gp1;Dbxref=GeneID:80527638;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=QKC91_gp1
NC_075404.1	RefSeq	transcript	122	1591	.	+	0	ID=transcript-YP_010790286.1;Parent=gene-QKC91_gp1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1	RefSeq	exon	122	1591	.	+	0	ID=exon-YP_010790286.1;Parent=transcript-YP_010790286.1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1	RefSeq	gene	1804	3254	.	+	.	ID=gene-QKC91_gp2;Dbxref=GeneID:80527633;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding;locus_tag=QKC91_gp2
NC_075404.1	RefSeq	transcript	1887	3074	.	+	0	ID=transcript-YP_010790287.1;Parent=gene-QKC91_gp2;Dbxref=GenBank:YP_010790287.1,GeneID:80527633;Name=YP_010790287.1;gbkey=CDS;gene=P;locus_tag=QKC91_gp2;product=phosphoprotein;protein_id=YP_010790287.1
NC_075404.1	RefSeq	exon	1887	3074	.	+	0	ID=exon-YP_010790287.1;Parent=transcript-YP_010790287.1;Dbxref=GenBank:YP_010790287.1,GeneID:80527633;Name=YP_010790287.1;gbkey=CDS;gene=P;locus_tag=QKC91_gp2;product=phosphoprotein;protein_id=YP_010790287.1
NC_075404.1	RefSeq	gene	3256	4496	.	+	.	ID=gene-QKC91_gp3;Dbxref=GeneID:80527634;Name=M;gbkey=Gene;gene=M;gene_biotype=protein_coding;locus_tag=QKC91_gp3
NC_075404.1	RefSeq	transcript	3290	4384	.	+	0	ID=transcript-YP_010790288.1;Parent=gene-QKC91_gp3;Dbxref=GenBank:YP_010790288.1,GeneID:80527634;Name=YP_010790288.1;gbkey=CDS;gene=M;locus_tag=QKC91_gp3;product=matrix protein;protein_id=YP_010790288.1
NC_075404.1	RefSeq	exon	3290	4384	.	+	0	ID=exon-YP_010790288.1;Parent=transcript-YP_010790288.1;Dbxref=GenBank:YP_010790288.1,GeneID:80527634;Name=YP_010790288.1;gbkey=CDS;gene=M;locus_tag=QKC91_gp3;product=matrix protein;protein_id=YP_010790288.1
NC_075404.1	RefSeq	gene	4498	6289	.	+	.	ID=gene-QKC91_gp4;Dbxref=GeneID:80527635;Name=F;gbkey=Gene;gene=F;gene_biotype=protein_coding;locus_tag=QKC91_gp4
NC_075404.1	RefSeq	transcript	4544	6205	.	+	0	ID=transcript-YP_010790289.1;Parent=gene-QKC91_gp4;Dbxref=GenBank:YP_010790289.1,GeneID:80527635;Name=YP_010790289.1;gbkey=CDS;gene=F;locus_tag=QKC91_gp4;product=fusion protein;protein_id=YP_010790289.1
NC_075404.1	RefSeq	exon	4544	6205	.	+	0	ID=exon-YP_010790289.1;Parent=transcript-YP_010790289.1;Dbxref=GenBank:YP_010790289.1,GeneID:80527635;Name=YP_010790289.1;gbkey=CDS;gene=F;locus_tag=QKC91_gp4;product=fusion protein;protein_id=YP_010790289.1
NC_075404.1	RefSeq	gene	6321	8322	.	+	.	ID=gene-QKC91_gp5;Dbxref=GeneID:80527636;Name=HN;gbkey=Gene;gene=HN;gene_biotype=protein_coding;locus_tag=QKC91_gp5
NC_075404.1	RefSeq	transcript	6412	8262	.	+	0	ID=transcript-YP_010790290.1;Parent=gene-QKC91_gp5;Dbxref=GenBank:YP_010790290.1,GeneID:80527636;Name=YP_010790290.1;gbkey=CDS;gene=HN;locus_tag=QKC91_gp5;product=hemagglutinin-neuraminidase;protein_id=YP_010790290.1
NC_075404.1	RefSeq	exon	6412	8262	.	+	0	ID=exon-YP_010790290.1;Parent=transcript-YP_010790290.1;Dbxref=GenBank:YP_010790290.1,GeneID:80527636;Name=YP_010790290.1;gbkey=CDS;gene=HN;locus_tag=QKC91_gp5;product=hemagglutinin-neuraminidase;protein_id=YP_010790290.1
NC_075404.1	RefSeq	gene	8370	15072	.	+	.	ID=gene-QKC91_gp6;Dbxref=GeneID:80527637;Name=L;gbkey=Gene;gene=L;gene_biotype=protein_coding;locus_tag=QKC91_gp6
NC_075404.1	RefSeq	transcript	8381	14995	.	+	0	ID=transcript-YP_010790291.1;Parent=gene-QKC91_gp6;Dbxref=GenBank:YP_010790291.1,GeneID:80527637;Name=YP_010790291.1;gbkey=CDS;gene=L;locus_tag=QKC91_gp6;product=RNA-dependent RNA polymerase;protein_id=YP_010790291.1
NC_075404.1	RefSeq	exon	8381	14995	.	+	0	ID=exon-YP_010790291.1;Parent=transcript-YP_010790291.1;Dbxref=GenBank:YP_010790291.1,GeneID:80527637;Name=YP_010790291.1;gbkey=CDS;gene=L;locus_tag=QKC91_gp6;product=RNA-dependent RNA polymerase;protein_id=YP_010790291.1

As this is not the first time we got this question (see #1074), I am going to talk with the team about the possibility of supporting these NCBI GTF/GFF annotation files for microorganisms. Maybe we can consider each CDS as a single-exon transcript. I will keep you updated on this.

Best regards,
Nuno

@dzc0104
Copy link
Author

dzc0104 commented Feb 26, 2024

Thank you for the response @nuno-agostinho
It worked for that reference. I have a question did you edit the gff file manually? I have other two references 1) https://www.ncbi.nlm.nih.gov/nuccore/NC_039223.1
2) https://www.ncbi.nlm.nih.gov/nuccore/AF077761 - this one has gff3 files and I tried to convert it into gff and even gtf but could not. Gff3 did not even bgzipped and tabixed.

@nuno-agostinho
Copy link
Contributor

nuno-agostinho commented Feb 27, 2024

Hi @dzc0104,

I manually created the file by basically:

  1. Duplicating the CDS lines
  2. Changing the feature to transcript and exon
  3. Changing their IDs to something unique
  4. Changing their Parent IDs:
    • Put the gene ID as the parent ID of the transcript
    • Put the transcript ID as the parent ID of the exon

Tell me if you need further instructions.

this one has gff3 files and I tried to convert it into gff and even gtf but could not. Gff3 did not even bgzipped and tabixed.

If you downloaded the GFF3 annotation via the Send to form in the top right corner of the record, you need to remove the last empty lines of the file before running bgzip and tabix. Tell me if this worked.

Cheers,
Nuno

@dzc0104
Copy link
Author

dzc0104 commented Mar 10, 2024

@nuno-agostinho Yay! It worked. Thank you very much, Nuno.

Regard,
Deepa

@dzc0104
Copy link
Author

dzc0104 commented Mar 18, 2024

@nuno-agostinho I still have a question. How can position 77 be associated with multiple types of genes, namely F, M, NP, and P? During my analysis, I observed that genomic position 77 is annotated with gene symbols F, M, NP, and P across various transcripts like this
Iso7- Vep.xlsx

I got this information from a dataset https://www.ncbi.nlm.nih.gov/nuccore/AF077761 that includes details about gene symbols and transcript types. But I'm not sure what it means biologically to have different gene types at the same position.

@nuno-agostinho
Copy link
Contributor

nuno-agostinho commented Mar 19, 2024

Hi @dzc0104,

The only results associated with genes F and M are upstream_gene_variant or downstram_gene_variant. Marking variants as upstream/downstream a gene is useful to understand variants that may affect those genes (maybe as regulatory regions).

However, the default distance between a variant and a transcript used by VEP to annotate up/downstream variants is 5 000 bp (optimised for vertebrates) and the genome you mentioned is small (15 186 bp). Please try to decrease the --distance parameter to make it more sense for your use case.

Hope this makes it clear.

Cheers,
Nuno

@dzc0104
Copy link
Author

dzc0104 commented May 22, 2024

Hi @nuno-agostinho,

Thank you for your assistance.

As part of my data analysis, I've identified synonymous variants and now I'm exploring their potential impacts at the amino acid level. While synonymous variants traditionally aren't thought to have functional impacts on protein structure, they can affect RNA stability, protein folding, evolutionary conservation, splicing regulation, and regulatory elements.

I've utilized Variant Effect Predictor (VEP) with the SIFT option (-sift b), but unfortunately, I didn't receive any relevant data in the output. Does this lack of prediction indicate that there are no available predictions for my variants?

Here's the command I used:
vep -i iso1p1_filtered.snp.vcf.gz
--gff /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ref/AF077761/sequence.gff3.gz
--fasta /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ref/AF077761/AF077761.fasta.gz
--species avian_orthoavulavirus
--sift b

Additionally, I'm seeking recommendations for other tools to analyze the functional impacts of synonymous variants, particularly those focusing on RNA-level effects, splicing regulation, and non-protein-coding impacts.

Thank you for your guidance! 😊

I have attached hereby the link to the VCF file.

iso1p1_filtered.snp.vcf.gz

Best regards,
Deepa

@nuno-agostinho
Copy link
Contributor

Hi @dzc0104,

VEP only returns pre-computed SIFT results stored in Ensembl databases in --database or --cache modes. However, we don't have SIFT results for avian orthoavulavirus. You may want to consider installing and running SIFT on your data, as per https://sift.bii.a-star.edu.sg.

Regarding additional tools to help predict variant consequences, some articles list such tools:

Hope this information was useful.

Cheers,
Nuno

@Joshua-Macleod
Copy link

Joshua-Macleod commented Aug 15, 2024

Hi @nuno-agostinho,

I have a similar issue as the one originally reported by @dzc0104 regarding intergenic variant calling.

I've built .gff3 files using both prokka and bakta for reference genomes against which I'm looking to find variants. Here's an excerpt of a bakta .gff3 below:

contig00001     Prodigal        CDS     265     723     .       +       0       ID=KAHBKG_00010;Name=Transcriptional regulator CtsR;locus_tag=KAHBKG_00010;product=Transcriptional regulator CtsR;Dbxref=COG:COG4463,COG:K,RefSeq:WP_003760062.1,SO:0001217,UniParc:UPI00000CC18E,UniRef:UniRef100_H1GA27,UniRef:UniRef50_A0A143YMT3,UniRef:UniRef90_G2ZA06;gene=ctsR
contig00001     Prodigal        CDS     736     1254    .       +       0       ID=KAHBKG_00015;Name=Protein-arginine kinase activator protein McsA;locus_tag=KAHBKG_00015;product=Protein-arginine kinase activator protein McsA;Dbxref=COG:COG3880,COG:O,RefSeq:WP_003760064.1,SO:0001217,UniParc:UPI0001EB894E,UniRef:UniRef100_A0A823H5C3,UniRef:UniRef50_H1GA28,UniRef:UniRef90_H1GA28;gene=mcsA
contig00001     Prodigal        CDS     1251    2273    .       +       0       ID=KAHBKG_00020;Name=protein arginine kinase;locus_tag=KAHBKG_00020;product=protein arginine kinase;Dbxref=COG:COG3869,COG:O,EC:2.7.14.1,GO:0004111,GO:0004672,GO:0005524,GO:0016310,GO:0046314,RefSeq:WP_010990301.1,SO:0001217,UniParc:UPI000013952D,UniRef:UniRef100_Q92F44,UniRef:UniRef50_Q48759,UniRef:UniRef90_Q48759;gene=mcsB
contig00001     Prodigal        CDS     2302    4764    .       +       0       ID=KAHBKG_00025;Name=endopeptidase Clp ATP-binding chain C;locus_tag=KAHBKG_00025;product=endopeptidase Clp ATP-binding chain C;Dbxref=COG:COG0542,COG:O,RefSeq:WP_003770116.1,SO:0001217,UniParc:UPI00000CC190,UniRef:UniRef100_A0A3H2VSB6,UniRef:UniRef50_A0A0F7N4K2,UniRef:UniRef90_A0A097B1Z0,VFDB:VFC0282,VFDB:VFG000079;gene=clpC

I've tried to make use of your method here:

  1. Duplicating the CDS lines
  2. Changing the feature to transcript and exon
  3. Changing their IDs to something unique
  4. Changing their Parent IDs:
    • Put the gene ID as the parent ID of the transcript
    • Put the transcript ID as the parent ID of the exon

and even changing CDS to gene in the .gff3 file and including a biotype to remedy the warning (just on the off chance...):

contig00001     Prodigal        gene    265     723     .       +       .       ID=gene-KAHBKG_00010;locus_tag=KAHBKG_00010;gene_biotype=protein_coding
contig00001     Prodigal        transcript      265     723     .       +       .       ID=KAHBKG_00010_t1000;Parent=gene-KAHBKG_00010;locus_tag=KAHBKG_00010
contig00001     Prodigal        exon    265     723     .       +       0       ID=KAHBKG_00010_e1000;Parent=KAHBKG_00010_t1000;locus_tag=KAHBKG_00010

However, I still receive warnings (WARNING: Unable to determine biotype of KAHBKG_01390) for approx. 30 IDs/locus_tags per .gff3 and variants are still called as intergenic even if the locations fall within a CDS.

Any recommendations here, or if you'd like me to provide test data, do let me know.

Cheers,
Joshua

@nuno-agostinho
Copy link
Contributor

nuno-agostinho commented Aug 16, 2024

Hi @Joshua-Macleod,

Based on that warning, I would say that those lines have no field indicating their biotype, so VEP can't determine whether they are part of a protein_coding transcript or not.

Could you show me the lines in your GFF3 file relative to KAHBKG_01390?

Best,
Nuno

@Joshua-Macleod
Copy link

Joshua-Macleod commented Aug 16, 2024

Hi @nuno-agostinho,

Thanks for getting back to me.

Here are the lines:

contig00001     Prodigal        gene    270089  271192  .       +       .       ID=gene-KAHBKG_01390;locus_tag=KAHBKG_01390;gene_biotype=protein_coding;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN
contig00001     Prodigal        transcript      270089  271192  .       +       .       ID=KAHBKG_01390_t1272;Parent=gene-KAHBKG_01390;locus_tag=KAHBKG_01390;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN
contig00001     Prodigal        exon    270089  271192  .       +       0       ID=KAHBKG_01390_e1272;Parent=KAHBKG_01390_t1272;locus_tag=KAHBKG_01390;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN

Worth noting, these aren't loci outputted by vep (edit: presumably wouldn't be for the same reason they're noted in the warnings - I didn't put two and two together).

Cheers,
Joshua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants