Transcript Document

Bio-Medical Informatics
Instructor : Hanif Yaghoobi
Website: site444703.44.webydo.com
E-mail : [email protected]
My personal Mail: [email protected]
About this Course
• Activities during the semester 5 score:
1)Home Works
2) MATLAB exercises
• Your Final Projects 3 score
• Final Exam 12 score
Shortliffe
“ Medical informatics is the rapidly
developing scientific field that deals
with resources, devices and formalized
methods for optimizing the storage,
retrieval and management of
biomedical information for problem
solving and decision making”
Edward Shortliffe, MD, PhD
1995
Organisms
• Classified into two types:
• Eukaryotes: contain a membrane-bound nucleus and
organelles (plants, animals, fungi,…)
• Prokaryotes: lack a true membrane-bound nucleus and
organelles (single-celled, includes bacteria)
• Not all single celled organisms are prokaryotes!
15
Cells
• Complex system enclosed in a
membrane
• Organisms are unicellular
(bacteria, baker’s yeast) or
multicellular
• Humans:
– 60 trillion cells
– 320 cell types
Example Animal Cell
www.ebi.ac.uk/microarray/ biology_intro.htm
16
DNA Basics – cont.
• DNA in Eukaryotes is organized in chromosomes.
17
Chromosomes
• In eukaryotes, nucleus
contains one or several
double stranded DNA
molecules orgainized as
chromosomes
• Humans:
– 22 Pairs of autosomes
– 1 pair sex chromosomes
Human Karyotype
http://avery.rutgers.edu/WSSP/StudentScholars/
Session8/Session8.html
18
www.biotec.or.th/Genome/whatGenome.html
19
What is DNA?
• DNA: Deoxyribonucleic Acid
• Single stranded molecule (oligomer, polynucleotide)
chain of nucleotides
• 4 different nucleotides:
–
–
–
–
Adenosine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
20
Nucleotide Bases
• Purines (A and G)
• Pyrimidines (C and T)
• Difference is in base structure
Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
21
DNA
22
23
The Central DogmaProtein Synthesis
Transcription
Translation
Cell
Function
Genome
Transcriptome
Gene Expression
Level
Proteome
Genome
• chromosomal DNA of an organism
• number of chromosomes and genome size varies
quite significantly from one organism to another
• Genome size and number of genes does not
necessarily determine organism complexity
28
Genome Comparison
ORGANISM
CHROMOSOMES
GENOME SIZE
GENES
Homo sapiens
(Humans)
23
3,200,000,000
~ 30,000
Mus musculus
(Mouse)
20
, 2600,000,000
~30,000
Drosophila
melanogaster
(Fruit Fly)
4
180,000,000
~18,000
Saccharomyces
cerevisiae (Yeast)
16
14,000,000
~6,000
Zea mays (Corn)
10
2,400,000,000
???
29
30
DNA Basics – cont.
• The DNA in each
chromosome can be
read as a discrete signal
to {a,t,c,g}. (For
example:
atgatcccaaatggaca…)
31
DNA Basics – cont.
• In genes (protein-coding region), during the
construction of proteins by amino acids, these
nucleotides (letters) are read as triplets
(codons). Every codon signals one amino acid
for the protein synthesis (there are 20 aa).
32
DNA Basics – cont.
• There are 6 ways of translating DNA signal to
codons signal, called the reading frames (3 * 2
directions).
…CATTGCCAGT…
33
DNA Basics – Cont.
…CATTGCCAGT…
Start: ATG
Stop: TAA, TGA, TAG
gene
Exon
Intron
Exon Intron
Exon
Exon
34
Understanding Genome Sequences
~3,289,000,000 characters:
aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat
attttgggccagtgaatttttttctaagctaatatagttatttggacttt
tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt
tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt
cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg
tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta
tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa
taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc
cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac
agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt
ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa
tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg
cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa
aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga
tttaggttgtttccagtttttactggcacagatacggcaatgaatataat
tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg
aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc
aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc
ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt
ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag
gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca
. . .
Goal:
Identify components encoded in the DNA sequence
35
Open Reading Frame
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
M
L
S
V
T
S . . .
Q
R STP
• Protein-encoding DNA sequence consists of a
sequence of 3 letter codons
• Starts with the START codon (ATG)
• Ends with a STOP codon (TAA, TAG, or TGA)
36
Finding Open Reading Frames
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
M
L
S
V
T
S . . .
Q
R STP
Try all possible starting points
• 3 possible offsets
• 2 possible strands
Simple algorithm finds all ORFs in a genome
• Many of these are spurious (are not real genes)
• How do we focus on the real ones?
37
Using Additional Genomes
Basic premise
“What is important is conserved”
Evolution = Variation + Selection
– Variation is random
– Selection reflects function
Idea:
• Instead of studying a single genome, compare
related genomes
• A real open reading frame will be conserved
38
Phylogentic Tree of Yeasts
S. cerevisiae
~10M years
S. paradoxus
S. mikatae
S. bayanus
C. glabrata
S. castellii
K. lactis
A. gossypii
K. waltii
D. hansenii
C. albicans
Y. lipolytica
N. crassa
M. graminearum
M. grisea
A. nidulans
S. pombe
39
Kellis et al, Nature 2003
Evolution of Open Reading Frame
S. cerevisiae
S. paradoxus
S. mikatae
S. bayanus
ATGCTCAGCGTGACCTCA
ATGCTCAGCGTGACATCA
ATGCTCAGGGTGACA--A
ATGCTCAGG---ACA--A
Conserved
positions
.
.
.
.
.
.
.
.
.
.
.
.
Frame shift
changes interpretation
of downstream seq
Variable
positions
A deletion
40
Conserved
Variable
Spurious ORF
Examples
Frame shift
ATG not
conserved
Confirmed ORF
Greedy algorithm to find conserved ORFs surprisingly
Sequencing
effective (> 99% accuracy) on verified yeast data
error
41
[Kellis et al, Nature 2003]
Conserved
Defining Conservation
Naïve approach
• Consensus between all
species
Problem:
• Rough grained
• Ignores distances between species
• Ignores the tree topology
Goal:
• More sensitive and robust
methods
A
A
A
A
Variable
A
C
A
G
A
A
A
A
C
C
C
A
A T C C
A C C A
A G C A
A G C A
A T C C
% conserv 100 33 55 55
42
Bioinformatics – an area of emerging knowledge
• Each cell of the body contains the whole DNA of the
individual (about 40,000 genes in the human genome,
each of them comprising from 50 to a mln base pairs
– A,T,C or G)
• The Main Dogma in Genetics: DNA->RNA->proteins
• Transcription: DNA (about 5%) -> mRNA
– DNA -> pre-RNA -> splicing -> mRNA (only the exons)
• Translation: mRNA -> proteins
– Proteins make cells alive and specialised (e.g. blue eyes)
– Genome -> proteome
N.Kasabov, 2003
Bioinformatics
• The area of Science that is concerned with the development and
applications of methods, tools and systems for storing and processing
of biological information to facilitate knowledge discovery.
• Interdisciplinary: Information and computer science, Molecular
Biology, Biochemistry, Genetics, Physics, Chemistry, Health and
Medicine, Mathematics and Statistics, Engineering, Social Sciences.
•
Biology, Medicine -- Information Science --> IT, Clinics, Pharmacy,
I____________________I
• Links to Health informatics, Clinical DSS, Pharmaceutical Industry
N.Kasabov, 2003
Bioinformatics: challenging problems for
computer and information sciences
•
Discovering patterns (features) from DNA and RNA
sequences (e.g. genes, promoters, RBS binding sites,
splice junctions)
•
Analysis of gene expression data and predicting
protein abundance
•
Discovering of gene networks – genes that are coregulated over time
•
Protein discovery and protein function analysis
•
Predicting the development of an organism from its
DNA code (?)
•
Modeling the full development (metabolic
processes) of a cell (?)
•
Implications: health; social,…
N.Kasabov, 2003
Problems in Computational Modeling for Bioinformatics
•
Abundance of genome data, RNA data, protein data and metabolic pathway data is now available
(see http://www.ncbi.nlm.nih.gov) and this is just the beginning of computational modeling in
Bioinformatics
•
Complex interactions:
– between proteins, genes, DNA code,
– between the genome and the environment
– much yet to to be discovered
•
Stability and repetitiveness: Genes are relatively stable carriers of information.
•
Many sources of uncertainty:
– Alternative splicing
– Mutation in genes caused by: ionising radiation (e.g. X-rays); chemical contamination,
replication errors, viruses that insert genes into host cells, aging processes, etc.
– Mutated genes express differently and cause the production of different proteins
•
It is extremely difficult to model dynamic, evolving processes
N.Kasabov, 2003
Bioinformatics Important Challenges
Transcription
Gene
Predication
Translation
Gene Function
Protein Function
Protein 3D Structure
Public Data Base
Transcription
DNA
sequence
{A,T,C,G}
Translation
Microarray
Gene Expression
Level
Protein sequence
KMLSLLMARTYW
Gene Expression
49
Microarray
• What can it be used for?
• How does it work?
• What are the Advantages?
An Example Application
Microarrays can be used for:
Comparison of transcription levels between two cells
Examples:
Comparison between:
Cells from a young mouse vs cell from an old mouse
Drug efficacy:
Treated cells vs untreated cells
How it works:
Based on hybridization
A=
C≡
T=
T=
G≡
A=
C≡
C≡
▀
U
G
A
A
C
U
G
G
A
C
T
T
G
A
C
C
▀
U
G
A
A
C
U
G
G
U
G
A
A
U
U
G
G
A=U
C≡ G
T=A
T=A
A≡ U
A=U
C≡ G
C≡ G
T
G
A
A
C
T
G
A= G
C≡
T=
T=
A≡
A=
C≡
C≡
▀
▀
mRNA
Probes and the printing process
Print
Head
slides
(100)
Microtiter
Plates
Print Head
Pins
Print Head with Pins
Microarray Technology
23/2/2008
60
sample
(labelled)
probe
(on chip)
pseudo-colour
image
[image from Jeremy Buhler]
Experimental design
 Track what’s on the chip
 which spot corresponds to which gene
 Duplicate experimental spots
 reproducibility
 Controls
 DNAs spotted on glass
 positive probe (induced or repressed)
 negative probe (bacterial genes on human chip)
 oligos on glass or synthesised on chip (Affymetrix)
 point mutants (hybridisation plus/minus)
Images from scanner
 Resolution
 standard 10m [currently, max 5m]
 100m spot on chip = 10 pixels in diameter
 Image format
 TIFF (tagged image file format) 16 bit (65’536 levels of grey)
 1cm x 1cm image at 16 bit = 2Mb (uncompressed)
 other formats exist e.g.. SCN (used at Stanford University)
 Separate image for each fluorescent sample
 channel 1, channel 2, etc.
Images in analysis software
 The two 16-bit images (cy3, cy5) are compressed into 8-bit images
 Goal : display fluorescence intensities for both wavelengths using a
24-bit RGB overlay image
 RGB image :
 Blue values (B) are set to 0
 Red values (R) are used for cy5 intensities
 Green values (G) are used for cy3 intensities
 Qualitative representation of results
Images : examples
Pseudo-color overlay
cy3
cy5
Spot color
Signal strength
Gene expression
yellow
Control = perturbed
unchanged
red
Control < perturbed
induced
green
Control > perturbed
repressed
Data : DNA Microarray
assay
gene 1
gene 2
gene 3
0
23/2/2008
10
20
30
time (min)
40
50
60
66
Data Required: Gene Expression Matrix
t1
t2
t3
t4
g1
0
1
2
1
g2
1
2
1
0
g3
0
1
1
1.
g4
1
2
1
0
23/2/2008
67
Data Required: Gene Expression Matrix
t1
t2
t3
t4
g1
0
1
2
1
0
g2
1
2
1
0
1
1.
g3
0
1
1
1.
1
0
g4
1
2
1
0
a1
a2
a3
a4
g1
0
3
1
1
g2
1
2
1
g3
0
1
g4
1
2
Snap Shot
23/2/2008
Time serious
68
• World Health Organization