-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam-xlate bed format /hg38 / gencode #4
Comments
Hi Darek, I also would like to know how to generate a sam-xlate compatible BED file from a GTF. Please, did you figure it out? Vitor |
Hi Vitor, I need to go through the old notes/emails to check the steps. I will keep you posted. Darek |
Hello Vitor, I have not found the working solution. But maybe we can figure it out. The ubu bed format is BED12 one line per transcript:
gtf2bed from bedops outputs one feature per line. And includes gene, transcript and exon. So the output must have transcript_id in the 4th column and the transcript exons data in:
Above is from the https://m.ensembl.org/info/website/upload/bed.html You may check this gist: but I have not tested it. DK |
Hi Darek, thank you for your insight. I have had no luck with the GTF -> BED12 conversion tools. Instead, I've created my own script, since the specifications that you described don't look very challenging. As far as I understood, I just need to take the GTF, (1) select the exon entries, (2) convert from 1-based to 0-based coordinates, (3) compute block sizes (exon lengths), and (4) compute block starts (start positions of each exon - start position of first exon), and append commas in the end for columns 11 and 12. Other columns (such as score, thickStart, thickEnd, itemRgb) seem to be all set to zero in the example BED, and are apparently not relevant for sam-xlate. Then, I tried to run a test with a single human sample, using Gencode v36. From the example in the wiki, it seems like we need a BAM file sorted by reference (chromosomes) and then by read names. This is difficult, since standard tools usually sort either by reference and position, or by read name. I asked this question on Biostars, and a tool was recommended. Another option would be to split the BAM by chromosome with bamtools, sort each resulting BAM by read name, and finally merge all BAMs. I tried using sam-xlate with a BAM only sorted by read names, and the job could not finish after 24 hours. When I sort by reference, and then by read names, it goes much faster. I wasn't able to provide a transcripts fasta for the I was able to run sam-xlate to convert a BAM generated with STAR. Then I estimated expression levels with Salmon for the sam-xlate BAM, and also for a BAM generated directly by STAR in transcriptome coordinates. When I compare the transcript-level estimates from the different methods, I see substantial differences. To begin with, the sam-xlate BAM is only 40% the size of the BAM generated by STAR in transcriptome coordinates. For transcripts with TPM = 0 in sam-xlate, 33k had TPM > 0 with STAR direct conversion (and 12k TPM > 1). On the other hand, for transcripts with TPM = 0 with STAR direct conversion, 6k had TPM > 0 with sam-xlate (and 2k TPM > 1). See figure below (similar picture for gene-level estimates). I will revise my conversion script, and further investigate the reasons for such differences. |
Hello,
i have tried:
I am getting following error:
My java version:
It seems that converting the gencode GTF with awk/bedops:
does not produce BED format resembling one provided with ubu:
Any ideas how to convert gencode.v31 for hg38 into ubu-compatible BED?
Many thanks for your help in advance.
Darek Kedra
The text was updated successfully, but these errors were encountered: