Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About gap extent penalty #31

Open
angeload opened this issue Jan 27, 2017 · 13 comments
Open

About gap extent penalty #31

angeload opened this issue Jan 27, 2017 · 13 comments

Comments

@angeload
Copy link

Hi Jeff,

Thank you for the development of Parasail. It is an outstanding piece of software.

It seems that Parasail is not applying a gap extend penalty when we open gaps in the sequence.
My work requires that both, gap open and gap extend penalties, be applied in order to calculate the final score. Is it possible to include this feature in Parasail?

Best regards,

Angelo Duarte

@jeffdaily
Copy link
Owner

I'm curious if we can emulate open+extend using the interface we have currently.

Let's say you want a gap open penalty of 10 and extension of 1. If you wanted parasail to act as though open+ext is applied, what would happen if you instead set the penalties to (10+1)=11 and 1? Would that act the same way?

@angeload
Copy link
Author

Thanks for your prompt answer.

I will try your suggestion and check if the results match. Meanwhile, please consider our help in implementing the gap extension penalty in Parasail.

@angeload
Copy link
Author

angeload commented Jan 30, 2017

gotoh.pdf

To use open+extend generates wrong results. The extend penalty (b) should be multiplied by the gap length (k) before adding to the gap opening penalty (a). The final formula for the total penalty is a + b*k.

@jeffdaily
Copy link
Owner

jeffdaily commented Jan 30, 2017 via email

@angeload
Copy link
Author

The bioinformaticians from the team have compared the results of both, the final alignment string and the obtained scores, using the alignment tool provided by NCBI, which uses the Needleman algorithm with Gotoh (https://blast.ncbi.nlm.nih.gov/Blast .cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=blastn&BLAST_SPEC=GlobalAln&LINK_LOC=BlastHomeLink). For performance comparison we use the classic package provided with Biolinux and Biojava.

In my last comment, I
gotoh.pdf
attached a PDF file with a summary of the algorithms and some references, but it seems that some problem occurred. I'm attaching the file again now. Please, let me know if you received it.

@jeffdaily
Copy link
Owner

My last reply was via email so the attachment did not come through. It was indeed available via this github issue tracker.

I read through the gotoh.pdf doc. Are you looking for a distance (minimum of positive value) measure as in the Gotoh example or a score (max) as in the SW example?

I followed the NCBI link. Do you have some hopefully short sequences I could use to see what kind of result you're after? My plan would be to run your sample sequences through the interface in the link and get some known, good baseline results. Then I would implement a non-vectorized basic serial implementation of the same algorithm to improve my unerstanding before embarking on the various vectorized versions of the same algorithm.

Of course, I am also happy to accept help.

For some background on parasail, I started with Mengyao's software and worked backwards. I templated the various vector intrinsics, added AVX2 support, added in-place statistics, etc. Eventually it didn't really resemble the original work. So the approach of not adding the gap extend penalty to the gap open penalty was carried over. I then implemented semi-global and global alignments by changing the initial boundary conditions as well as which cell in the DP table represented the final score. Again, the gap open and extend feature carried over.

I'm hoping we can finally work towards the solution you're after, and hopefully it's a feature that others would find useful, too.

@angeload
Copy link
Author

angeload commented Feb 1, 2017

We are studying the best way to explain the problem to you. Since our team has computer scientists and bioinformaticians, sometimes we have to synchronize the ideas. I'll contact you soon. Please keep the comment open.

@jeffdaily
Copy link
Owner

I'll definitely keep the comment open. Full disclosure, I'm "only" a computer scientist and learned just enough bio to write parasail.

@angeload
Copy link
Author

angeload commented Feb 1, 2017

Do not worry. We are on the same side.

@angeload
Copy link
Author

angeload commented Feb 8, 2017

We read the Parasail nw.c source code and found that the difference in the results is caused by the way Parasail calculates the gap penalty and does the matrix initialization. This does not mean that Parasail is wrong. It is just different the way the bioinformaticians do the calculus. Please, also consider that we could have misunderstood your code.

According to our bioinformatics guys, we should penalise a gap opening also considering that it is an extension. Therefore, the first element of matrix H (line 58 in nw.c) should be not 0 but the value of gap open penalty. Also, in the loop in line 62, we should use 'j' instead of (j-1). For example, for an open penalty of 5 and extend penalty of 2, H[] should be -5, -7, -9, ...

We also think that vector NH (line 71) should be initialized with gap open penalty and WH (line 72) should use 'i' instead (i-1).

We are implementing this simple changing in your code for comparing the results with our approach. We hope to update you soon. Meanwhile, we are trying to better understand the meanings of the variables in the nw.c code. Could you explain what role they have in the code?

@jeffdaily
Copy link
Owner

jeffdaily commented Feb 8, 2017

I really appreciate you digging into the code. This should be fun.

The nw.c source file is my non-vectorized (i.e., serial) implementation of global alignment. The rest of the vectorized global alignment implementations are self-consistent with nw.c, so if I got nw.c wrong then the rest are wrong. I hope not. But then again, I only ever personally used local and semi-global; I implemented global alignment for the sake of the larger community.

Though you've probably noticed by now, note that nw.c computes row-wise. Here's a summary of the variables within the nw.c source file:

  • s1[i] -- character from sequence 1 (query), arranged as columnb along left side of DP table
  • s2[j] -- character from sequence 2 (database), arranged as row along top of DP table
  • H[j] -- final score row, max of F[i], current E, and H_dag temporary variable
  • F[j] -- score row within deletion table, i.e., "up" or "north" or "N"
  • E -- score within insertion table, i.e., "left" or "west" or "W"
  • NH -- "north H"
  • WH -- "west H"
  • NWH -- "north west H"
  • H_dag -- "H diagonal", temporary variable holding contribution from match/mismatch
  • H_new -- temporary variable holding next final H value
  • E_opn -- score when opening gap in insertion table
  • E_ext -- score when extending gap in insertion table
  • F_opn -- score when opening gap in deletion table
  • F_ext -- score when extending gap in deletion table

The primary differences between global, semi-global, and local alignment implementations are the boundary conditions and final score locations. I'll try to illustrate the boundary conditions.

global alignments

|       |               | s2[0] | s2[1]         | s2[2]         | s2[3]         |
|       | 0             | -open | -open - 1*ext | -open - 2*ext | -open - 3*ext |
| s1[0] | -open         | <>    | <>            | <>            | <>            |
| s1[1] | -open - 1*ext | <>    | <>            | <>            | <>            |
| s1[2] | -open - 2*ext | <>    | <>            | <>            | <>            |
| s1[3] | -open - 3*ext | <>    | <>            | <>            | <>            |
| s1[4] | -open - 4*ext | <>    | <>            | <>            | <>            |

local and semi-global alignments

|       |   | s2[0] | s2[1] | s2[2] | s2[3] |
|       | 0 | 0     | 0     | 0     | 0     |
| s1[0] | 0 | <>    | <>    | <>    | <>    |
| s1[1] | 0 | <>    | <>    | <>    | <>    |
| s1[2] | 0 | <>    | <>    | <>    | <>    |
| s1[3] | 0 | <>    | <>    | <>    | <>    |
| s1[4] | 0 | <>    | <>    | <>    | <>    |

As noted in the main README.md: "When any of the algorithms open a gap, only the gap open penalty alone is applied." Again, this is a carry-over from when I was understanding Mengyao's implementation. Further, comparing results against EMBOSS showed I was getting the same scores.

I want to get this correct. I don't know the bioinformatics domain well enough to make a value judgement concerning which approach is correct. Is one or the other approach (-open-gap vs. just -open) more biologically relevant or is it a preference of the researcher? I'm just trying to understand.

I hope that you're able to make your changes to the nw.c code and validate it against your own methods. Then we can figure out what to do from there.

@jeffdaily
Copy link
Owner

Which parasail function are you using in this evaluation? I'm assuming it's one of the local alignment functions?

@iagorios
Copy link

iagorios commented Mar 2, 2017

I'm using nw_striped_16 from Aligner (with Java extension)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants