Statistics in Bioinformatics

Download Report

Transcript Statistics in Bioinformatics

Statistics in Bioinformatics

May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code, PCR primer), Bayes’ Theorem Homework 10

Statistical analysis of results underlies bioinformatics

When you run a program the computer will always give an answer.

The bioinformaticist will analyze the data from two points of view: 1) Statistical 2) Biological Assessment through these filters will determine if the result is reasonable

Two big questions

1. Does the result fit with what is currently known about biology (protein structure, evolution, function, etc.)?

2. Could the results have been obtained by random chance? Part of this comes from scientific intuition but another part comes from statistics.

Types of statistics used in bioinformatics

Yes-Likelihood methods No-ANOVA, regression analysis, hypothesis testing When one performs a sequence comparison search one must ask what is the likelihood that one would obtain a match based on random chance. This depends on the sequence you are searching for and the amount of data within the database you are mining.

Equally likely outcomes

sample space S= set of all possible outcomes.

Assumption: all outcomes are equally likely. Then, for any event A (=set of outcomes) P(A)=number of elements in A = |A| number of elements in S |S| For an experiment consisting of k parts, each of which can have n i outcomes |S|=n 1 n 2 . . .n

k

Multiplication Rule

n

things taken

k

at a time

with repetition

is n k Familiar example: the genetic code. Given that there are 4 nucleotides (A,T,G,C) how many different triplet codons are possible?

This is the same as saying 4 items taken 3 at a time with repetition.

Answer: 4 4 4 4 3 = 64 Position: 1 2 3

Multiplication rule

n

things taken

k

at a time

with repetition

is n k Second example: the PCR primer design. How many different PCR primers are possible of 16 nucleotides in length?

4 This is the same as saying 4 items taken 16 at a time with repetition.

Answer: 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 16 = 4.29 x 10 9 Position: 1 2 3 4 5 6 7 8 9 10111213141516 Any 16mer pattern can be expected to appear approximately once in the human genome by chance alone because the human genome contains 3 x 10 9 bases

The expectation (E) value

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Similarity Score (S) that is assigned to a match between two sequences. The higher the score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between sequences. The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.

Bayesian Analysis

A method for including additional information from previous experience with similar data to calculate probabilities.

Useful for bioinformatics because as you perform an analysis you build up experience that will let you predict future outcomes.

For example, if you think that you are in an alpha helix what is the probability that the next amino acid is:   Alanine?

Proline?

Bayes’ Theorem

Experience Experience P(X|Y) =P(Y|X)•P(X) P(Y) Calculated Calculated P(X|Y) is the probability of X occurring given (|) condition Y Actual parameters (hidden) Estimated parameters (from formula) Observed results (from which we derive information) New probability

Bayes’ Theorem Example

You can predict whether a protein is DNA binding protein based on its amino acid sequence.

P(DNA binder|sequence) What information do you require to calculate this?

P(DNA binder|sequence) = P(sequence|DNA binder)• P(DNA binder) P(sequence) Bioinformatics experimental method: 1) 2) Go to SwissProt Obtain sequences of all of the known DNA binding proteins 3) 4) Obtain sequences of all of the other proteins Calculate probabilites What are the sources of error for this approach?

How to compute relevant probabilities?

1) Obtain all sequences of known DNA binders. Check for The particular aa sequence and compute its percentage.

P(aa sequence/DNA binder)= # of protein with given sequence which are binders # of known DNA binders 2) Compute the percentage of proteins that contain the sequence Among all proteins in the database.

P(sequence) = # of proteins containing sequence # proteins 3) From the currently known part of the genome P(DNA binder) = 13.5%