About gap extent penalty #31

angeload · 2017-01-27T18:20:37Z

Hi Jeff,

Thank you for the development of Parasail. It is an outstanding piece of software.

It seems that Parasail is not applying a gap extend penalty when we open gaps in the sequence.
My work requires that both, gap open and gap extend penalties, be applied in order to calculate the final score. Is it possible to include this feature in Parasail?

Best regards,

Angelo Duarte

jeffdaily · 2017-01-27T19:17:37Z

I'm curious if we can emulate open+extend using the interface we have currently.

Let's say you want a gap open penalty of 10 and extension of 1. If you wanted parasail to act as though open+ext is applied, what would happen if you instead set the penalties to (10+1)=11 and 1? Would that act the same way?

angeload · 2017-01-27T20:37:31Z

Thanks for your prompt answer.

I will try your suggestion and check if the results match. Meanwhile, please consider our help in implementing the gap extension penalty in Parasail.

angeload · 2017-01-30T13:20:04Z

gotoh.pdf

To use open+extend generates wrong results. The extend penalty (b) should be multiplied by the gap length (k) before adding to the gap opening penalty (a). The final formula for the total penalty is a + b*k.

jeffdaily · 2017-01-30T18:28:44Z

Sorry that didn't work for you. Thanks for trying. I'd like more help understanding how to implement this. Is there a paper or book chapter on this particular approach? Is there another software package I could verify against? For example, it was useful for me to verify parasail against emboss.

…

On Jan 30, 2017, at 7:20 AM, Angelo Duarte ***@***.***> wrote: To use open+extend generates wrong results. The extend penalty (b) should be multiplied by the gap length (k) before adding to the gap opening penalty (a). The final formula for the total penalty is a + b*k. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

angeload · 2017-01-31T14:15:01Z

The bioinformaticians from the team have compared the results of both, the final alignment string and the obtained scores, using the alignment tool provided by NCBI, which uses the Needleman algorithm with Gotoh (https://blast.ncbi.nlm.nih.gov/Blast .cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=blastn&BLAST_SPEC=GlobalAln&LINK_LOC=BlastHomeLink). For performance comparison we use the classic package provided with Biolinux and Biojava.

In my last comment, I
gotoh.pdf
attached a PDF file with a summary of the algorithms and some references, but it seems that some problem occurred. I'm attaching the file again now. Please, let me know if you received it.

jeffdaily · 2017-01-31T14:58:19Z

My last reply was via email so the attachment did not come through. It was indeed available via this github issue tracker.

I read through the gotoh.pdf doc. Are you looking for a distance (minimum of positive value) measure as in the Gotoh example or a score (max) as in the SW example?

I followed the NCBI link. Do you have some hopefully short sequences I could use to see what kind of result you're after? My plan would be to run your sample sequences through the interface in the link and get some known, good baseline results. Then I would implement a non-vectorized basic serial implementation of the same algorithm to improve my unerstanding before embarking on the various vectorized versions of the same algorithm.

Of course, I am also happy to accept help.

For some background on parasail, I started with Mengyao's software and worked backwards. I templated the various vector intrinsics, added AVX2 support, added in-place statistics, etc. Eventually it didn't really resemble the original work. So the approach of not adding the gap extend penalty to the gap open penalty was carried over. I then implemented semi-global and global alignments by changing the initial boundary conditions as well as which cell in the DP table represented the final score. Again, the gap open and extend feature carried over.

I'm hoping we can finally work towards the solution you're after, and hopefully it's a feature that others would find useful, too.

angeload · 2017-02-01T14:55:58Z

We are studying the best way to explain the problem to you. Since our team has computer scientists and bioinformaticians, sometimes we have to synchronize the ideas. I'll contact you soon. Please keep the comment open.

jeffdaily · 2017-02-01T15:38:17Z

I'll definitely keep the comment open. Full disclosure, I'm "only" a computer scientist and learned just enough bio to write parasail.

angeload · 2017-02-01T18:39:14Z

Do not worry. We are on the same side.

angeload · 2017-02-08T15:38:22Z

We read the Parasail nw.c source code and found that the difference in the results is caused by the way Parasail calculates the gap penalty and does the matrix initialization. This does not mean that Parasail is wrong. It is just different the way the bioinformaticians do the calculus. Please, also consider that we could have misunderstood your code.

According to our bioinformatics guys, we should penalise a gap opening also considering that it is an extension. Therefore, the first element of matrix H (line 58 in nw.c) should be not 0 but the value of gap open penalty. Also, in the loop in line 62, we should use 'j' instead of (j-1). For example, for an open penalty of 5 and extend penalty of 2, H[] should be -5, -7, -9, ...

We also think that vector NH (line 71) should be initialized with gap open penalty and WH (line 72) should use 'i' instead (i-1).

We are implementing this simple changing in your code for comparing the results with our approach. We hope to update you soon. Meanwhile, we are trying to better understand the meanings of the variables in the nw.c code. Could you explain what role they have in the code?

jeffdaily · 2017-02-08T19:26:59Z

I really appreciate you digging into the code. This should be fun.

The nw.c source file is my non-vectorized (i.e., serial) implementation of global alignment. The rest of the vectorized global alignment implementations are self-consistent with nw.c, so if I got nw.c wrong then the rest are wrong. I hope not. But then again, I only ever personally used local and semi-global; I implemented global alignment for the sake of the larger community.

Though you've probably noticed by now, note that nw.c computes row-wise. Here's a summary of the variables within the nw.c source file:

s1[i] -- character from sequence 1 (query), arranged as columnb along left side of DP table
s2[j] -- character from sequence 2 (database), arranged as row along top of DP table
H[j] -- final score row, max of F[i], current E, and H_dag temporary variable
F[j] -- score row within deletion table, i.e., "up" or "north" or "N"
E -- score within insertion table, i.e., "left" or "west" or "W"
NH -- "north H"
WH -- "west H"
NWH -- "north west H"
H_dag -- "H diagonal", temporary variable holding contribution from match/mismatch
H_new -- temporary variable holding next final H value
E_opn -- score when opening gap in insertion table
E_ext -- score when extending gap in insertion table
F_opn -- score when opening gap in deletion table
F_ext -- score when extending gap in deletion table

The primary differences between global, semi-global, and local alignment implementations are the boundary conditions and final score locations. I'll try to illustrate the boundary conditions.

global alignments

|       |               | s2[0] | s2[1]         | s2[2]         | s2[3]         |
|       | 0             | -open | -open - 1*ext | -open - 2*ext | -open - 3*ext |
| s1[0] | -open         | <>    | <>            | <>            | <>            |
| s1[1] | -open - 1*ext | <>    | <>            | <>            | <>            |
| s1[2] | -open - 2*ext | <>    | <>            | <>            | <>            |
| s1[3] | -open - 3*ext | <>    | <>            | <>            | <>            |
| s1[4] | -open - 4*ext | <>    | <>            | <>            | <>            |

local and semi-global alignments

|       |   | s2[0] | s2[1] | s2[2] | s2[3] |
|       | 0 | 0     | 0     | 0     | 0     |
| s1[0] | 0 | <>    | <>    | <>    | <>    |
| s1[1] | 0 | <>    | <>    | <>    | <>    |
| s1[2] | 0 | <>    | <>    | <>    | <>    |
| s1[3] | 0 | <>    | <>    | <>    | <>    |
| s1[4] | 0 | <>    | <>    | <>    | <>    |

As noted in the main README.md: "When any of the algorithms open a gap, only the gap open penalty alone is applied." Again, this is a carry-over from when I was understanding Mengyao's implementation. Further, comparing results against EMBOSS showed I was getting the same scores.

I want to get this correct. I don't know the bioinformatics domain well enough to make a value judgement concerning which approach is correct. Is one or the other approach (-open-gap vs. just -open) more biologically relevant or is it a preference of the researcher? I'm just trying to understand.

I hope that you're able to make your changes to the nw.c code and validate it against your own methods. Then we can figure out what to do from there.

jeffdaily · 2017-02-24T18:48:37Z

Which parasail function are you using in this evaluation? I'm assuming it's one of the local alignment functions?

iagorios · 2017-03-02T12:10:51Z

I'm using nw_striped_16 from Aligner (with Java extension)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About gap extent penalty #31

About gap extent penalty #31

angeload commented Jan 27, 2017

jeffdaily commented Jan 27, 2017

angeload commented Jan 27, 2017

angeload commented Jan 30, 2017 •

edited

Loading

jeffdaily commented Jan 30, 2017 via email

angeload commented Jan 31, 2017

jeffdaily commented Jan 31, 2017

angeload commented Feb 1, 2017

jeffdaily commented Feb 1, 2017

angeload commented Feb 1, 2017

angeload commented Feb 8, 2017

jeffdaily commented Feb 8, 2017 •

edited

Loading

jeffdaily commented Feb 24, 2017

iagorios commented Mar 2, 2017

About gap extent penalty #31

About gap extent penalty #31

Comments

angeload commented Jan 27, 2017

jeffdaily commented Jan 27, 2017

angeload commented Jan 27, 2017

angeload commented Jan 30, 2017 • edited Loading

jeffdaily commented Jan 30, 2017 via email

angeload commented Jan 31, 2017

jeffdaily commented Jan 31, 2017

angeload commented Feb 1, 2017

jeffdaily commented Feb 1, 2017

angeload commented Feb 1, 2017

angeload commented Feb 8, 2017

jeffdaily commented Feb 8, 2017 • edited Loading

global alignments

local and semi-global alignments

jeffdaily commented Feb 24, 2017

iagorios commented Mar 2, 2017

angeload commented Jan 30, 2017 •

edited

Loading

jeffdaily commented Feb 8, 2017 •

edited

Loading