Pharmacogenomics

Download Report

Transcript Pharmacogenomics

Pharmacogenomics and
Bioinformatics
M. Saleet Jafri
What is pharmacogenomics?
• Pharmacogenomics is the use genomic and sequence
data of host and pathogens to identify potential drug
targets
• Involves a variety of techniques/disciplines such as
sequence analysis, protein structure, genomics,
micorarray analysis and others
• These fields rely heavily on bioinformatics
• Usually focuses on medical or agricultural applications
Human Genome Project
Project goals are to
• identify all the approximately 20,000-25,000 genes in
human DNA,
• determine the sequences of the 3 billion chemical base
pairs that make up human DNA,
• store this information in databases,
• improve tools for data analysis,
• transfer related technologies to the private sector, and
• address the ethical, legal, and social issues (ELSI) that
may arise from the project.
From http://www.ornl.gov/hgmis/
Human Genome Project
Progress
- Several types of genome maps have already been
completed, and a working draft of the entire human
genome sequence was announced in June 2000, with
analyses published in February 2001.
- An important feature of this project is the federal
government's long-standing dedication to the transfer of
technology to the private sector. By licensing
technologies to private companies and awarding grants
for innovative research, the project is catalyzing the
multibillion-dollar U.S. biotechnology industry and
fostering the development of new medical applications.
From http://www.ornl.gov/hgmis/
Human Genome Project
• Seven organisms were originally chosen for sequencing.
– E. coli
– Yeast
– Fly
– Worm
– Arabidopsis
– Mouse
– human
• Why were these chosen?
Genome Projects
As of January 2005 there were many more sequenced
– 25 non-plant eukaryotes
– 5 plants
– 213 microbes completed
– 21 Archae
– 274 microbes in progress
– 1431 viruses in progress
– 833 non-virus organisms with at least on nucleotide
sequence submitted
• Why were these chosen?
Genome Projects
• Chosen by funding agencies
• Four main categories
– Medical applications
– Evolutionary significance
– Environmental impact
– Food production
How are genomics used for drug
target identification?
• The basic idea is to look for genes unique to the
pathogen that are crucial for its survival. This would be
the drug target.
• If this is a pathogen in the host, the gene would be in the
pathogen and not in the host.
• If this was in the environment, the gene should be as
specific as possible for the pathogen to avoid harming
other organisms that might be beneficial.
How can this be done?
• To do this genomics, proteomics and bioinformatics are
involved.
• In any of these cases bioinformatics tools are necessary.
Genome Sequencing and
Comparison
• As mentioned earlier, many pathogen (virus, bacteria,
and other microorganisms) have been sequenced.
• Once they are sequenced, they are annotated.
Annotation is the process by which the functions of the
different proteins (genes) are determined.
• In this way, an understanding of the organisms
metabolism is gained.
Malaria
• Malaria is caused by the genus Plasmodium, with
Plasmodium falciparum being the most lethal.
• Its genome has been sequenced
• It is a pathogen that digests proteins for food. It does not
contain any amino acid producing genes in its genome,
i.e. it does not make its own amino acids.
• Purines are recycled, but there are not genes for purine
synthesis.
• Has many solute ATP dependent transporters and one
novel multifunctional transporter.
How is annotation done?
• Annotation is the process of predicting the function of
genes in a genome.
• First all the genes have to be found. This is done by
finding the open reading frame (ORF).
• This is done by gene finding or gene prediction software.
Gene Prediction
• Analysis by sequence similarity can only reliably identify
about 30% of the protein-coding genes in a genome
• 50-80% of new genes identified have a partial, marginal,
or unidentified homolog
• Frequently expressed genes tend to be more easily
identifiable by homology than rarely expressed genes
Gene Finding
• Process of identifying potential coding regions in an
uncharacterized region of the genome
• Still a subject of active research
• There are many different gene finding software
packages and no one program is capable of finding
everything
Eukaryotes vs Prokaryotes
• Eukaryotic DNA wrapped around histones that might
result in repeated patterns (periodicity of 10) for
histone binding. The promotor regions might be near
these sites so that they remain hidden.
• Prokaryotes have no introns.
• Promotor regions and start sites more highly
conserved in Prokaryotes
• Different codon use frequencies
Gene finding is species-specific
• Codon usage patterns vary by species
• Functional regions (promoters, splice
sites, translation initiation sites,
termination signals) vary by species
• Common repeat sequences are speciesspecific
• Gene finding programs rely on this
information to identify coding regions
The genetic code
Codon usage
Identifying ORFs
• Simple first step in gene finding
• Translate genomic sequence in six frames. Identify stop
codons in each frame
• Regions without stop codons are called "open reading
frames" or ORFs
• Locate and tag all of the likely ORFs in a sequence
• The longest ORF from a Met codon is a good prediction
of a protein encoding sequence.
• SOFTWARE: NCBI ORF Finder
ORF Finder input
ORF finder results
Tests of the Predicted ORF
• Check if the third base in the codons tends to be the
same one more often than by chance alone.
• Are the codons used in the ORF the same as those
used in other genes (need codon usage frequency).
• Compare the amino acid sequence for similarity with
other know amino acid sequences.
Problems with ORF finding
• A single-character sequencing error can hide a stop
codon or insert a false stop codon, preventing
accurate identification of ORFs
• Short exons can be overlooked
• Multiple transcripts or ORFs on complementary
strand can confuse results
Pattern-based gene finding
• ORF finding based on start and stop codon
frequency is a pattern-based procedure
• Other pattern-based procedures recognize
characteristic sequences associated with known
features and genes, such as ribosome binding
sites, promoter sites, histone binding sites, etc.
• Statistically based.
Content-based gene finding
• Content-based gene finding methods rely on
statistical information derived from known sequences
to predict unknown genes
• Some evaluative measures include: "coding
potential" (based on codon bias), periodicity in the
sequence, sequence homogeneity, etc.
A standard content-based
alignment procedure
• Select a window of DNA sequence from the unknown.
The window is usually around 100 base pairs long
• Evaluate the window's potential as a gene, based on a
variety of factors
• Move the window over by one base
• Repeat procedure until end of sequence is reached;
report continuous high-scoring regions as putative
genes
Combining measures
• Programs rarely use one measure to predict genes
• Different values are combined (using probabilistic
methods, discriminant analysis, neural net methods, etc.)
to produce one "score" for the entire window
Drawbacks to window-based
evaluation
• A sequence length of at least 100 b.p. is required
before significant information can be gained from the
analysis
• Results in a +/- 100 b.p. uncertainty in the start site
of predicted coding regions, unless an unambiguous
pattern can also be found to indicate the start.
Most are web-based, but...
• Submit sequence; input sequence length may
be limited
• Select parameters, if any
• Interpret results
• Most software is first or second generation;
results come in non-graphical formats.
• GeneMark, GenScan, Glimmer
How is annotation done?
• This is done by comparing the DNA sequences of the
genes to known genes in a database. If they sequences
are similar, the a similar function is assumed.
• The comparison is done using sequence comparison
tools such as BLAST
Database Searching for Similar
Sequences
• Database searching for similar sequences is ubiquitous
in bioinformatics.
• Databases are large and getting larger
• Need fast methods
Types of Searches
• Sequence similarity search with query sequence
• Alignment search with profile (scoring matrix with gap
penalties)
• Serch with position-specific scoring matrix representing
ungapped sequence alignment
• Iterative alignment search for similar sequences that
starts with a query sequence, builds a multiple alignmnet,
and then uses the alignment to augment the search
• Search query sequence for patterns representative of
protein families
From Bioinformatics by Mount
DNA vs Protein Searches
• DNA sequences consists of 4 characters (nucleotides)
• Protein sequences consist of 20 characters (amino acids)
• Hence, it is easier to detect patterns in protein sequences
than DNA sequences
• Better to convert DNA sequences to protein sequences
for searches.
Database Searching Efficacy
• To evaluate searching methods, selectivity and
sensitivity need to be considered.
• Selectivity is the ability of the method not to find
members known to be of another group (i.e. false
positives).
• Sensitivity is the ability of the method to find members
of the same protein family as the query sequence.
Protein Searches
• Easier to identify protein families by sequence similarity
rather than structural similarity. (same structure does
not mean same sequence)
• Use the appropriate gap penalty scorings
• Evaluate results for statistical significance.
History
• Historically dynamic programming was used for
database sequence similarity searching.
• Computer memory, disk space, and CPU speed were
limiting factors.
• Speed still a factor due to the larger databases and
increase number of searches.
• FASTA and BLAST allow fast searching.
History
• The PAM250 matrix was used for a long time. It
corresponds to a period of time where only 20% of the
amino acids have remained unchanged.
• BLOSUM has replace PAM250 in most applications.
BLAST use the BLOSUM62 matrix. FASTA uses the
BLOSUM50 matrix.
Search Tools
• Similarity Search Tools
– Smith-Waterman Searching
• Heuristic Search Tools
– FASTA
– BLAST
Malaria Vaccine
• A German and American Team used reverse genetics
i.e. they used the sequenced genome, deduced the
candidate genes, and then knocked out a particular gene
(Uis3).
• This give 30 day immunity in mice which is better than
vaccines made by traditional methods
Microarray Data Analysis
Gene chips allow the simultaneous monitoring of the
expression level of thousands of genes. Many statistical
and computational methods are used to analyze this data.
These include:
– statistical hypothesis tests for differential expression
analysis
– principal component analysis and other methods for
visualizing high-dimensional microarray data
– cluster analysis for grouping together genes or samples
with similar expression patterns
– hidden Markov models, neural networks and other
classifiers for predictively classifying sample expression
patters as one of several types (diseased, ie.
cancerous, vs. normal)
What is Microarray Data?
In spite of the ability to allow us to simultaneously monitor
the expression of thousands of genes, there are some
liabilities with micorarray data. Each micorarray is very
expensive, the statistical reproducibility of the data is
relatively poor, and there are a lot of genes and complex
interactions in the genome.
Microarray data is often arranged in an n x m matrix M with
rows for the n genes and columns for the m biological
samples in which gene expression has been monitored.
Hence, mij is the expression level of gene i in sample j. A
row ei is the gene expression pattern of gene i over all the
samples. A column sj is the expression level of all genes in
a sample j and is called the sample expression pattern.
Types of Microarrays
• cDNA microarray
• Nylon membrane and plastic arrays (by Clontech)
• Oligonucleotide silicon chips (by Affymetrix)
• Note: Each new version of a microarray chip is at
least slightly different from the previous version. This
means that the measures are likely to change. This
has to be taken into account when analyzing data.
cDNA Microarray
• The expression level eij of a gene i in sample j is
expressed as a log ratio, log(rij/gi), of the log of its
actual expression level rij in this sample over its
expression level gi in a control.
• When this data is visualized eij is color coded to a
mixture of red (rij >> gi) and green (rij << gi) and a
mixture in between.
Nylon Membrane and Plastic Arrays (by
Clontech)
• A raw intensity and a background value are
measured for each gene.
• The analyst is free to choose the raw intensity or can
adjust it by subtracting the background intensity.
Oligonucleotide Silicon Chips (by
Affymetrix)
• These arrays produce a variety of numbers derived
from 16-20 pairs of perfect match (PM) and mismatch
(MM) probes.
• There are several statistics related to gene expression
that can be derived from this data. The most
commonly used one is the average difference (AVD),
which is derived from the differences of PM-MM in the
16-20 probe pairs.
• The next most commonly used method is the log
absolute value (LAV), which comes from the ratios
PM/MM in the probe pairs.
• Note: The Affymetrix gene-chip software has a
absent/present call for each gene on a chip.
According to Jagota, the method is complex and
arbitrary so they usually ignore it.
For What Do We Use Microarray Data?
• Genes with similar expression patterns over all
samples – We can compare the expression patterns
ei and ei’ of two genes i and i' over all samples.
• If we use cluster analysis, we can separate the genes
into groups of genes with similar expression patterns
(trees).
• This will allow us to find what unknown genes have
altered expression in a particular disease by
comparing the pattern to genes know to be affiliated
with a disease.
• It can also find genes that fit a certain pattern such as
a particular pattern of change with time.
• It can also characterize broad functional classes of
new genes from the known classes of genes with
similar expression.
For What Do We Use Microarray Data?
• Genes with unusual expression levels in a sample –
In contrast to standard statistical methods where we
ignore outliers, here outliers might have particular
importance. Hence, we look for genes whose
expression levels are very different from the others.
• Genes whose expression levels vary across samples
– We can compare gene expression levels of a
particular gene or set of genes in different samples.
This can be used to look compare normal and
diseased tissues or diseased tissue before and after
treatment.
For What Do We Use Microarray Data?
• Samples that have similar expression patterns – We
might want to compare the expression patters of all
genes between two samples. We might cluster the
genes into gene with similar expression patterns to
help with the comparison. This can be used to look
compare normal and diseased tissues or diseased
tissue before and after treatment.
• Tissues that might be cancerous (diseased) – We
can take the gene expression pattern of sample and
compare it to library expression patterns that indicate
diseased or not diseased tissue.
Statistical Methods Can Help
• Experimental Design – Since using microarrays is
costly and time consuming, we want to design
experiments to use the minimal number of
micorarrays that will give a statistically significant
result.
• Data Pre-processing – It is sometimes useful to
preprocess the data prior to visualization. An
example of this is the log ratio mentioned earlier. It is
often necessary to rescale data from different
microarrays so that they can be compared. This is
due to variation in chip to chip intensity. Another
type of preprocessing is subtracting the mean and
dividing by the variance.
Statistical Methods Can Help
• Data Visualization – Principle component analysis
and multidimensional scaling are two useful
techniques for reducing multidimensional data to two
and three dimensions. This allows us to visualize it.
• Cluster Analysis – By associating genes with similar
expression patterns, we might be able to draw
conclusions about their functional expression.
• Probability Theory – We can use statistical modeling
and inference to analyze our data. Probability theory
is the basis for these.
Statistical Methods Can Help
• Statistical Inference – This is the formulation and
statistical testing of a hypothesis and alternative
hypothesis.
• Classifiers for the Data – We can construct classes
from data, such a diseased vs. non-diseased tissue.
We can build a model (such as a hidden Markov
model) that fits know data for the different classes.
This can then be used to classify previously
unclassified data.
Preprocessing Microarray Data
• Before microarray data can be analyzed or stored, a
number of procedures or transformations must be
applied to it.
• In order to analyze the data correctly, it is important to
understand what the transformations might be doing
to the data.
Preprocessing Microarray Data
•
•
•
•
•
•
•
•
•
•
Ratioing the data
Log-tranforming ratioed data
Alternative to ratioing the data
Differencing the data
Scaling data across chips to account for chip-to-chip
difference
Zero-centering a gene on a sample expression pattern
Weighting the components of a gene or sample
expression pattern differently
Handling missing data
Variation filtering expression patterns
Discretizing expression data
Cluster Analysis of Microarray Data
• Recall that microarray data can be thought of as gene
expression patterns or sample expression patterns.
These can be each considered to be vectors. The first
thing we have to do before applying cluster analysis is to
find a distance between the various expression pattern
vectors. This is done using similarity/dissimilarity
measures such as Euclidean distance, Mahalonobis
distance, or linear correlation coefficients. Once a
distance matrix is computed, the following clustering
algorithms can be used. The clusters formed can differ
significantly depending upon the distance measure used.
Cluster Analysis of Microarray Data
• Hierarchical Clustering – Assume each data point is in a
singleton cluster.
– Find the two clusters that are closest together.
Combine these to form a new cluster.
– Compute the distance from all clusters to the new
cluster using some form of averaging.
– Find the two closest clusters and repeat.
Cluster Analysis of Microarray Data
• k-Means Clustering – An alternate method of clustering
called k-means clustering, partitions the data into k
clusters and finds cluster means i for each cluster. In
our case, the means will be vectors also. Usually, the
number of clusters k is fixed in advance. To choose k
something must be know about the data. There might
be a range of possible k values. To decide which is best,
optimization of a quantity that maximizes cluster
tightness ie. minimizes distances between points in a
cluster.
Cluster Analysis of Microarray Data
• Self-organizing Maps – This is basically an application of
neural networks to microarray data. Assume that there
is a 2-dimensional grid of cells and a map from a given
set of expression data vectors in Rn, ie, there are n
nodes in the input layer and a connection neuron from
each of these to each cell. Each cell (i, j) gets it own
weight from n input neurons. The weight vector mij is
the mean of the cluster associated with cell (i, j). Each
data vector d gets mapped to the cell (i, j) that is closest
to d using Euclidean distance.In order to train the
network, the mean vectors mij for the cells (i, j) must be
learned.
Sample Microarray
Correlations
Clustering of Genes
Personalized Medicine
• There is a new buzz word called personalized medicine.
• The idea is to develop medicine and treatment plan
based on an individuals genetic make-up.
Proteomics
• Understanding protein function
• Functional genomics
• Multiple approaches – structure, expression levels,
biochemistry, modeling etc.
• Combining technologies is necessary to understand in
vivo protein functional
Approach
• Use data to determine pathway.
• Use biochemistry to figure out kinetics and
concentrations.
• Use new proteomic approaches to determine relative
concentrations.
• Apply pathway model to determine functional
consequence.
Pathway Data
• Using molecular biological techniques we can
determine what proteins make up a biochemical
pathway.
A
B
D
C
Pathways
• Biochemical Pathways form complex biochemical
reaction networks.
• There might be multiple ways to get from A to B.
• The path chosen depends on biochemical kinetics.
Biochemistry
• Classical biochemistry isolates proteins from tissue or
cells.
• Modern molecular biology allows the production of
purified protein.
• The concentration of the protein is determined
• The kinetic properties of the proteins is determined by
biochemical assay – rates of reactions, modulating
factors, etc.
Pathway Modeling Methods
•
•
•
•
Boolean Models
Metabolic Control Theory – Flux Balance Analysis
Biochemical Systems Analysis
Kinetic Modeling Approach
Disorders of Thrombophilia
• The functional consequences of
nonsynonymous SNPS can be predicted by
comparison of protein structures.
• There are various SNPs know
– Activated protein C resistance by Arg 506 to Glu
– Prothrombing polymorphism (G20210A) causing
elevated prothrombin levels
– Protein C deficiency
– Protein S deficiency
– Antithormbin deficiency
– Elevated factor VIII levels
Fibrinogen Abnormalities
• Various polymorphisms found in the long
arm of chromosome 4
• Two dimorphisms of the b-chain gene are
of major importance and in linkage
disequilibrium with each other.
• These affect plasma fibrogen levels
Prothrombin G20210
Polymorphism
• Replacement of a G by A at nucleotide
20210 in the untranslated section of the
prothrombin gene increases translation
without altering transcription of the gene.
• This results in elevated synthesis and
secretion of prothrombin by the liver.
• This results in increased thrombin levels
Activated protein C deficiency
• Factor V Leiden R506Q mutation occurs in
8% of the population.
• It is a GA substitution at nucleotide 1691
in the gene for factor V.
• Factor V is cleaved less efficiently by
activated protein C
• Results in deep vein thrombosis, early
kidney transplant loss, recurrent
miscarriages and other disorders