菜鳥學算法——動態規劃(一)

Dynamic Programming

The following is an example of global sequence alignment using Needleman/Wunsch techniques. For this example, the two sequences to be globally aligned are

G A A T T C A G T T A (sequence #1) 
G G A T C G A (sequence #2)

So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)

A simple scoring scheme is assumed where

  • Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise
  • Si,j = 0 (mismatch score)
  • w = 0 (gap penalty)

Three steps in dynamic programming

  1. Initialization
  2. Matrix fill (scoring)
  3. Traceback (alignment)

Initialization Step

The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.

Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Matrix Fill Step

One possible (inefficient) solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.

For each position, Mi,j is defined to be the maximum score at position i,j; i.e.

Mi,j = MAXIMUM[
     Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
     Mi,j-1 + w (gap in sequence #1),
     Mi-1,j + w (gap in sequence #2)]

Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.

Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the beginning, w = 0. Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1.

A value of 1 is then placed in position 1,1 of the scoring matrix.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1. Take the example of row 1. At column 2, the value is the max of 0 (for a mismatch), 0 (for a vertical gap) or 1 (horizontal gap). The rest of row 1 can be filled out similarly until we get to column 8. At this point, there is a G in both sequences (light blue). Thus, the value for the cell at row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or 1 (horizontal gap). The value will again be 1. The rest of row 1 and column 1 can be filled with 1 using the above reasoning.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Now let's look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.

At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.

Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note that for all of the remaining positions except the last one in column 2, the choices for the value will be the exact same as in row 4 since there are no matches. The final row will contain the value 2 since it is the maximum of 2 (match), 1 (horizontal gap) and 2(vertical gap).

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Using the same techniques as described for column 2, we can fill in column 3.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

After filling in all of the values the score matrix is as follows:

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Traceback Step

After the matrix fill step, the maximum alignment score for the two test sequences is 6. The traceback step determines the actual alignment(s) that result in the maximum score. Note that with a simple scoring algorithm such as one that is used here, there are likely to be multiple maximal alignments.

The traceback step begins in the M,J position in the matrix, i.e. the position that leads to the maximal score. In this case, there is a 6 in that location.


Traceback takes the current cell and looks to the neighbor cells that could be direct predacessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for traceback chooses as the next cell in the sequence one of the possible predacessors. In this case, the neighbors are marked in red. They are all also equal to 5.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else, the only possible predacessor is the diagonal match/mismatch neighbor. If more than one possible predacessor exists, any can be chosen. This gives us a current alignment of 

    (Seq #1)      A 
                  |
    (Seq #2)      A

So now we look at the current cell and determine which cell is its direct predacessor. In this case, it is the cell with the red 5.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

The alignment as described in the above step adds a gap to sequence #2, so the current alignment is

    (Seq #1)     T A
                   |
    (Seq #2)     _ A

Once again, the direct predacessor produces a gap in sequence #2.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

After this step, the current alignment is

      (Seq #1)     T T A
                       |
                   _ _ A

Continuing on with the traceback step, we eventually get to a position in column 0 row 0 which tells us that traceback is completed. One possible maximum alignment is :

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Giving an alignment of :

          G A A T T C A G T T A
          |   |   | |   |     | 
          G G A _ T C _ G _ _ A

An alternate solution is:

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Giving an alignment of :

          G _ A A T T C A G T T A
          |     |   | |   |     | 
          G G _ A _ T C _ G _ _ A
There are more alternative solutions each resulting in a maximal global alignment score of 6. Since this is an exponential problem, most dynamic programming algorithms will only print out a single solution.

Advanced Dynamic Programming Tutorial
If you haven't looked at an example of a simple scoring scheme, please go to the simple dynamic programming example

The following is an example of global sequence alignment using Needleman/Wunsch techniques. For this example, the two sequences to be globally aligned are

G A A T T C A G T T A (sequence #1) 
G G A T C G A (sequence #2)

So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)

An advanced scoring scheme is assumed where

  • Si,j = 2 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise
  • Si,j = -1 (mismatch score)
  • w = -2 (gap penalty)

Initialization Step

The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.

The first row and first column of the matrix can be initially filled with 0.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Matrix Fill Step

One possible (inefficient) solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.

For each position, Mi,j is defined to be the maximum score at position i,j; i.e.

Mi,j = MAXIMUM[
     Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
     Mi,j-1 + w (gap in sequence #1),
     Mi-1,j + w (gap in sequence #2)]

Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.

Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, S1,1 = 2, and by the assumptions stated earlier, w = -2. Thus, M1,1 = MAX[M0,0 + 2, M1,0 - 2, M0,1 - 2] = MAX[2, -2, -2].

A value of 2 is then placed in position 1,1 of the scoring matrix. Note that there is also an arrow placed back into the cell that resulted in the maximum score, M[0,0].

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Moving down the first column to row 2, we can see that there is once again a match in both sequences. Thus, S1,2 = 2. So M1,2 = MAX[M0,1 + 2, M1,1 - 2, M0,2 -2] = MAX[0 + 2, 2 - 2, 0 - 2] = MAX[2, 0, -2].

A value of 2 is then placed in position 1,2 of the scoring matrix and an arrow is placed to point back to M[0,1] which led to the maximum score.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Looking at column 1 row 3, there is not a match in the sequences, so S 1,3 = -1. M1,3 = MAX[M0,2 - 1, M1,2 - 2, M0,3 - 2] = MAX[0 - 1, 2 - 2, 0 - 2] = MAX[-1, 0, -2].

A value of 0 is then placed in position 1,3 of the scoring matrix and an arrow is placed to point back to M[1,2] which led to the maximum score.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

We can continue filling in the cells of the scoring matrix using the same reasoning.

Eventually, we get to column 3 row 2. Since there is not a match in the sequences at this positon, S3,2 = -1. M3,2 = MAX[ M2,1 - 1, M3,1 - 2, M2,2 - 2] = MAX[0 - 1, -1 - 2, 1 -2] = MAX[-1, -3, -1].

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Note that in the above case, there are two different ways to get the maximum score. In such a case, pointers are placed back to all of the cells that can produce the maximum score.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

The rest of the score matrix can then be filled in. The completed score matrix will be as follows:

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

Traceback Step
After the matrix fill step, the maximum global alignment score for the two sequences is 3. The traceback step will determine the actual alignment(s) that result in the maximum score.

The traceback step begins in the M,J position in the matrix, i.e. the position where both sequences are globally aligned.


Since we have kept pointers back to all possible predacessors, the traceback step is simple. At each cell, we look to see where we move next according to the pointers. To begin, the only possible predacessor is the diagonal match.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

This gives us an alignment of

    A
    | 
    A

Note that the blue letters and gold arrows indicate the path leading to the maximum score.


We can continue to follow the path using a single pointer until we get to the following situation.

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

The alignment at this point is

    T C A G T T A
    | |   |     | 
    T C _ G _ _ A

Note that there are now two possible neighbors that could result in the current score. In such a case, one of the neighbors is arbitrarily chosen.


Once the traceback is completed, it can be seen that there are only two possible paths leading to a maximal global alignment.


One possible path is as follows:

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

This gives an alignment of

   G A A T T C A G T T A
   |   |   | |   |     | 
   G G A _ T C _ G _ _ A

The other possible path is as follows:

菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客

This gives an alignment of

   G A A T T C A G T T A
   |   | |   |   |     |
   G G A T _ C _ G _ _ A

Remembering that the scoring scheme is +2 for a match, -1 for a mismatch, and -2 for a gap, both sequences can be tested to make sure that they result in a score of 3.

   G A A T T C A G T T A
   |   |   | |   |     | 
   G G A _ T C _ G _ _ A
 
   + - + - + + - + - - +
   2 1 2 2 2 2 2 2 2 2 2

2 - 1 + 2 - 2 + 2 + 2 - 2 + 2 - 2 - 2 + 2 = 3

   G A A T T C A G T T A
   |   | |   |   |     |
   G G A T _ C _ G _ _ A

   + - + + - + - + - - +
   2 1 2 2 2 2 2 2 2 2 2

2 - 1 + 2 + 2 - 2 + 2 - 2 + 2 - 2 - 2 + 2 = 3

so both of these alignments do indeed result in the maximal alignment score.


 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
菜鳥學算法——動態規劃(一) - IMAX - IMAX 的博客
 
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章