Bioinformatics and Computational Molecular Biology

Algorithms and Hardware

Hidden Markov Models (HMMs)

 

Hidden Markov Models can be used to generate a statistical representation of a family of similar sequences. Unlike the Smith-Waterman method, the substitution matrices and the gap start and gap continuation penalties can vary with position within the model.

 

Symbols Used in Graphical Representation:

 

Transition Probabilities:

pn : probability of starting a new insertion at position n

qn : probability of continuing an insertion at position n

qnm : probability of starting a new deletion at position n (m = n+1)

rnm : probability of continuing a deletion at position n (m = n+1)

snm : probability of ending an insertion at position n (m = n+1)

unm : probability of ending a deletion at position n (m = n+1)

rn : probability of changing from deletion to insertion at position n

tnm : probability of changing from insertion to deletion at position n (m = n+1)

Note: pnm, snm, and unm are the weighted sum of transition probabilities into the Mx node which each depend on the current string character being compared to the model.

 

Identities:

pn + pnm + qnm = 1 (must leave node Mx)

qn + tnm + snm = 1 (must either stay in node Ix or leave)

rn + rnm + unm = 1 (must leave node Dx)

 

Graphical Representation (for model with 2 characters in sequence most probable):

 

Example: most probable string is QW

Probability of a mutation is taken to be 0.095 total and 0.005 for each of the 19 possible amino acid mutations (19*0.005 = 0.095) at both nodes M1 and M2.

If the current character of the string being compared to the model is Q when entering M1 then there is a match otherwise a mutation.

If the current character of the string being compared to the model is W when entering M2 then there is a match otherwise a mutation.

p01 = p12 = 0.6 (with match probability 0.505 and mutation probability 0.095)

p23 = 0.8 (does not depend on any character value since it ends the string)

s01 = s12 = 0.5 (with match probability 0.405 and mutation probability 0.095)

p23 = 0.6 (does not depend on any character value since it ends the string)

u12 = 0.5 (with match probability 0.405 and mutation probability 0.095)

u23 = 0.9 (does not depend on any character value since it ends the string)

q0 = q1 = q2 = 0.4

p0 = p1 = 22 = 0.2

q01 = q12 = 0.2

t01 = t12 = 0.1

r1 = r2 = 0.1

r12 = 0.4

 

Comparing the String QPW to the Example Model:

There are many possible paths through the model than produce QPW, but the one corresponding to the following alignment generates the highest probability (the dash indicates an insertion in the model, which is the same as a deletion in the comparison string):

Q P W (comparison string)

Q - W (model)

The path taken through the model for this alignment is shown in bold. The node names have been replaced by the character than has been emitted by the model (M1=Q, I1=P, M2=W).

The probability that this alignment fits the model is given by the product of the probabilities of each transition: 0.505 * 0.2 * 0.405 * 0.8 = 0.033

Since computers can do additions much faster than multiplications, the natural log of each transition probability is normally specified rather than the actual transition probability. The natural log of the fit probability is then the sum of the individual natural logs.

ln(p01 with match) = ln(0.505) = -0.683

ln(p1) = ln(0.2) = -1.609

ln(p12 with match) = ln(0.405) = -0.904

ln(p23) = ln(0.8) = -0.223

The natural log of the probability that this alignment fits is therefore: -0.683 + (-1.609) + (-0.904) + (-0.223) = -3.419

When using probabilities the result should be between 0 and 1, with 0 being the least probable and 1 being the most probable. When using logs of probabilities the result should be between minus infinity and 0, with minus infinity being the least probable and 0 being the most probable.

 

An Alternate (not as good) Path Through the Model for QPW:

The following alignment for QPW with the model is also possible. It is obvious using eyeball alignment that this is not as good and the calculated probability will show us that this is in fact so.

Q P W (comparison string)

Q W - (model)

The probability that this alignment fits the model is given by the product of the probabilities of each transition: 0.505 * 0.005 * 0.2 * 0.6 = 0.0003

The log probability would be -0.683 + (-5.298) + (-1.609) + (-0.511) = -8.101

Using either probabilities or log probabilities, this alignment is clearly worse since it generates lower values.

 

An Even Worse Path Through the Model for QPW:

- Q P W (comparison string)

Q W - - (model)

The probability that this alignment fits the model is given by the product of the probabilities of each transition: 0.2 * 0.005 * 0.2 * 0.4 * 0.6 = 0.00005

The log probability would be -1.609 + (-5.298) + (-1.609) + (-0.916) + (-0.511) = -9.943

 

An Even Worse Yet Path Through the Model for QPW:

- - Q P W (comparison string)

Q W - - - (model)

The probability that this alignment fits the model is given by the product of the probabilities of each transition: 0.2 * 0.4 * 0.1 * 0.4 * 0.4 * 0.6 = 0.0008

The log probability would be -1.609 + (-0.916) + (-2.303) + (-0.916) + (-0.916) + (-0.511)= -7.171

Even though the path through the model is clearly worse in this case than the previous case the score for this path is higher. The reason is that the probabilities in the model are just made up numbers. The problem is that the 0.4 probability on the insert loops is clearly too high. If the parameters of an HMM are poorly estimated, then the alignments it creates will also be poor.

 

An Alignment that can not be Produced for QPW:

The following alignment requires the model to "emit" three deletes. Since it is not possible to visit any given Dx state more than once, we can not have more deletes ("-" in the comparison string) than there are Dx states in the model.

- - - Q P W (comparison string)

Q W - - - - (model)

 

Boise State University College of Engineering

Boise State University Department of Electrical and Computer Engineering

This page created by Dr. Scott F. Smith

This page was last updated on 14 February 2004.