Skip to content

Sequence Alignment (Needleman–Wunsch Algorithm using Dynamic Programming) for aligning sequences (words, sentences, DNA etc.)

Notifications You must be signed in to change notification settings

bolds07/sequence-alignment

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence Alignment

Implementation of the classic Dynamic Programming problem using the Needleman–Wunsch algorithm which requires quadratic space & time complexity.

Problem Statement

Given 2 sequences, find the minimum cost of aligning the 2 sequences
Gaps can be inserted to 1 sequence or the other, but not at the same time
2 = Gap Penalty (δ)
If 2 characters are aligned with each other, there may be a mismatch penalty (αi j)

  • 0 = aligning identical letters
  • 1 = aligning a vowel with a vowel, or a consonant with a consonant
  • 3 = aligning a vowel with a consonant

Minimum cost = sum of mismatch & gap penalties (the optimal alignment)

Optimal Substructure

Runtime

O(M*N)
Storage: also O(M*N) which requires a large 2D array

Pseudocode

Usage

Requirements & Caveats

  • Letters only (no spaces, numbers or special characters)
  • _ (Underscore character) is reserved to represent
  • View results in a fixed-width font for the 2 sequences to be lined up
  • Leading & trailing whitespace is trimmed
  • Uppercase / Lowercase is ignored (all converted to lowercase)
  • predecessorIndexes is calculated when creating memoTable but not used to find the actual alignment, only to show where the values in memoTable came from
  • For example: if predecessorIndexes[4][4] contained the array [4, 3], it means the value of memoTable[4][4] (which is 6) came from memoTable[4][3] (i.e. case3, seq2 with a gap so it came from the left)
  • predecessorIndexes[0][0] contains the array [-1, -1] because the upper left corner has no predecessor, it was the initial base case

Setup

  • Provide 2 strings (edit testSequences 2D array)
  • Optionally change the GAP_PENALTY and the mismatch penalties
    Currently VOWEL_VOWEL_PENALTY and CONSONANT_CONSONANT_PENALTY are the same, but are arbitrary

Example: aligning "mean" with "name"

Code Notes

  • findAlignment() should only be called from inside calcOptimalAlignment() because it uses memoTable & the 2 sequences to display the actual alignment
  • seq1 & seq2 have a leading space added (this is after trimming any trailing/leading whitespace)
    This is so that seq1.charAt(i) & seq2.charAt(j) work in the loops & the string indexes match up
    It causes some adjustments for the array sizes.
  • memoTable = new int[seq1.length()][seq2.length()] uses the exact value of length() which includes the space
  • Loops like for(int i=0; i<seq1.length(); i++) use <seq1.length() to not go out of bounds of the string indexes
  • In findAlignment(), int i = seq1.length()-1; & int j = seq2.length()-1; since the 2 sequences have leading spaces & arrays are indexes from 0
  • findAlignment() basically retraces each calculation in the memoTable to see where it came from
    There may be multiple optimal paths, but this only finds 1 of them

References

About

Sequence Alignment (Needleman–Wunsch Algorithm using Dynamic Programming) for aligning sequences (words, sentences, DNA etc.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%