Transcript Microarrays
Introduction to Bioinformatics
Juris Viksna, IMCS UL
2015
Alvis Brazma, European Bioinformatics Institute
Planned course schedule
Regular lectures:
Wednesdays, 14:30-16:00
Most likely the first 4 (or so) will stick to this, schedule, followed by a break in March, and then restarting in April, with replacement lectures scheduled sometime in April-May.
Will try my best to invite guest lecturers (most likely this will involve rescheduled lecture times), but this is subject to options that might (or might not) become available.
Course requierements
To obtain a credit for this course you must: - submit a programming project
(worth 50% of grade);
- take a (written) exam
(worth 50% of grade).
Course web page: http://susurs.mii.lu.lv/juris/courses/bi2015.html
Bioinformatics
• •
Databases
and tools to store and access biomolecular data
Sequence algorithms
– assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and
phylogenetics
• 3D
structure
analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular
interaction
analysis (e.g., protein interactions) • Dynamic systems, modelling of
biological networks
systems and • Analysis of noisy measurement data,
statistical analysis
•
Data management
, databases, interfaces, web services • Links with health records,
biomedical informatics
Why Bioinformatics might be important for you?
• This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed
Topics covered in this course:
• Introduction into biology as information science •
Overview
of some bioinformatics problems • Bio
sequence and structure analysis
, molecular
evolution and phylogeny
etc • • Genomics – DNA assembly,
haplotypes
etc
Gene regulation network modelling
networks, dynamic systems) (graph theory, Boolean • Analysis of gene expression data, cluster analysis,
data mining and analysis
• Data management and analysis for biomedical studies • Some new recently evolving topics (time and material availability permitting...)
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)
N.p.k. Datums 1.
2.
3.
13.09.2013
Lekcijas temats Ievadlekcija. Prasības kursa apgūšanai un literatūras avoti. Bioinformātikas jēdziens. Kas ir bioinformātika un kāpēc tā biologiem vajadzīga? Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi 20.09.2013
Bioloģiskās informācijas veidi un apjoms. Genomu organizācija. Modernās genomu analīzes metodes 27.09.2013
Genomu evolūcija. Salīdzinošā genomika 4.
5.
6.
7.
04.10.2013
Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas 11.10.2013
Dažādu bioloģiskās informācijas datubāžu izmantošanas piemēri 18.10.2013
Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Nukleīnskābju un proteīnu sekvenču pāru salīdzināšana. BLAST veidi 25.10.2013
Nukleīnskābju un proteīnu daudzkārtējās salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi. Datorprogrammas nukleīnskābju un proteīnu sekvenču daudzkārtējai salīdzināšanai 8 9 10.
11.
12.
13.
14.
01.11.2013
Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā 08.11.2013
Seminārs un uzdevumu pārbaude par tēmām, kas saistītas ar informācijas meklēšanu datu bāzēs un sekvenču homoloģijas meklēšanu 15.11.2013
Datorprogrammas nukleīnskābju un proteīnu sekvenču filoģenētiskajai analīzei 22.11.2013
Nodarbība nenotiks 29.11.2013
Makromolekulu telpiskā struktūra un tās paredzēšana. DNS topoloģija. Proteīnu struktūras paredzēšana, modelēšana un pielietojums farmakoloģijā 13.09.2013.
Genoma ekspresijas analīze. Transkriptomika. DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika. Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa.
13.12.2013 Seminārs un uzdevumu pārbaude par tēmām, kas saistītas ar filoģenētisko analīzi un proteīnu sekundārās struktūras paredzēšanu. Bioinformātikas perspektīvas. Bioinformātika kā
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)
• Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)
• Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)
• DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai
NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY July 17, 2000
• •
Bioinformatics:
Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data.
Computational Biology:
The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.
Human Genome Project
• Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published
The results of HGP
• 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics
The things however have recently changed again to be stored and/or analyzed.
– with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs
All you need to know about Molecular Biology
One of the first textbooks in bioinformatics
MIT press 2000
Few other textbooks for «Computer Scientists»
MIT press 2004 Chapman and Hall/CRC 1995
Some bioinformatics problems from the perspective of Computer Science
Genome sequencing and assembly
Genome sequencing and assembly
E.Green (2001)
Strategies for the systematic sequencing of complex genomes.
Nat Rev Genetics, Vol 2:8, 573-583.
Ensembl genome browser
Genome sequence assembly
Genome sequence assembly
Sequence assembly problem
Ok, let us assume that we have these hybridizations.
How can we reconstruct the initial DNA sequence from them?
Affymetrix GeneChip W.Bains, C.Smith (1988)
A novel method for nucleic acid sequence determination.
Journal of theoretical biology .Vol. 135:3, 303-307.
Sequence assembly problem
Ok, let us assume that we have these hybridizations.
How can we reconstruct the initial DNA sequence from them?
SBH – Hamiltonian path approach
SBH – Hamiltonian path approach
Hamiltonian path (cycle problem)
Hamiltonian path (cycle) problem
For a given graph find a path (cycle) that visits every
vertex
exactly once (or show that such path does not exist).
NP
-hard.
That means that there are no algorithm that works in realistic time already for comparatively small graphs.
SBH – Eulerian path approach
Eulerian path (cycle) problem
Eulerian path (cycle) problem
For a given graph find a path (cycle) that visits every
edge
exactly once (or show that such path does not exist).
Eulerian path (cycle) problem
Eulerian cycle exists if and only if each of graph vertices has
even degree
. Moreover, there is a simple linear time algorithm for finding Eulerian cycle.
Eulerian path (cycle) problem
For a given graph find a path (cycle) that visits every
edge
exactly once (or show that such path does not exist).
Next Generation Sequencing (Illumina)
In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50 150 bp).
Sequence mappers
Sequence assembly – deBruijn graphs
Sequence assembly – deBruijn graphs
D.Zebino, E.Birney (2008)
Velvet: algorithms for de novo short read assembly using de Bruijn graphs
. Genome Research, Vol. 18:5, 821 829.
All you need to know about Molecular Biology
Central dogma of molecular biology
transcription DNA translation RNA Protein
DNA
Four different nucleotides :
adenosine
,
guanine
,
cytosine
and
thymine
. They are usually referred to as denoted by their initial letters,
A
,
C
,
G
and
T
bases
and 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'
DNA
DNA - Biology as and information science
5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as
CGATTCAACGATGC
The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately
75 GB or 12.5 CD-Roms per cm
.
DNA replication – copying the information
Polymerase chain reaction – PCR – Xeroxing the DNA
Genome sequencing
• Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas – A version of PCR – Separation of molecules by chemical properties such as weight or length of the DNA – Molecule labelling and fluorescent labelling in particular – DNA fragmentation in random length bits
Anatomy of a chromosome
• Centromeres are the largest constriction of the chromosome •
Site of attachment of spindle fibers
•
100,000s of 171 base pair repeat, called alpha satellite sequences
•
Centromere associated proteins are bound
[Adapted from R.Yasbin]
Genomes, chromosomes
Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes Organism Number or chromosomes Bacteria Yeast 1 12 Genome size in base pairs ~400,000 - ~10,000,000 14,000,000 Worm Fly Weed Human 6 4 5 23 100,000,000 300,000,000 125,000,000 3,000,000,000
Genome sizes
Organism Bacteria Yeast Worm Fly Weed Human Number or chromosomes 1 12 6 4 5 23 Genome size in base pairs ~400,000 ~10,000,000 14,000,000 100,000,000 300,000,000 125,000,000 3,000,000,000 Year sequenced 1995 1997 1999 2000 2001 2003 Information in the human genome – up to
0.75 TB
www.ensembl.org
Genomes and genes
control statement TATA box start Ribosome binding 5’ utr Termination (stop) control statement gene
Transcription (RNA polymerase)
mRNA
Translation (Ribosome)
3’ utr Protein
Chromosomes - Eucariots
Chromosomes - Procariots
(Eucariotic) cell
[Adapted from Online Biology Book]
(Procariotic) cell
Viruses
Operones
[Adapted from R.Shamir]
Exons and introns
Exons and introns
Gene regulation
[Adapted from http://www.gennetworks.co.uk/products/trrd.shtml]
RNA
• Like DNA, RNA consists of 4 nucleotides, but instead of the thymine (T), it has an alternative uracil (U) • RNA is similar to a DNA, but it’s chemical properties are such that it keeps itself single stranded • RNA is complimentary to a single stranded DNA 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' DNA | | | | | | | | | | | | | | | 3' G-C U -A-A-C-G U U -G-C U -A-C-G 5' RNA
RNA structure
RNA sequence: ...AGGCUAUGGCCA...
Single-stranded, but A tends to pair with U G tends to pair with C
Splicing, translation, proteins
Ribosomes – read in RNA and use single amino acids to output proteins
Proteins
• A chain of 20 different amino acids • Have complex 3 D shapes • They are the main building blocks of a cell – 10% of the cellular mass is proteins, almost all the rest is water – Structural proteins – Functional proteins • Enzymes – biological catalysts
Protein structure and their function
Proteins are chains of 20 different types of amino acids, and they have complex structures determined by their sequence. The structures in turn determine their functions.
Discovering the genetic code
• Was done by feeding all 64 different nucleotide triplets to protein synthesis machinery (which was provided with all 20 amino acids) • This allowed to map triplets to amino acids • It turned out that the code is the same in all organisms –
the genetic code is universal!
Genetic code
Genetic code Completely worked out in 1962
Why we need to compare sequences?
Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes) What are biological functions of these genes??
We can find out what protein (sequence) gene encodes But we still do not know what this protein does...
However we can search for known proteins with
similar
sequences and such that functions of these proteins are known We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans...
But we can try to use similar protein (e.g. in mice) and start our experiments with them
Evolution of sequences
Mutations are a natural process of DNA evolution DNA
replication
errors:
substitutions
insertions
deletions
}
indels
Similarity between sequences: indicates their common ancestral origin indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions
How to compare sequences?
Given two proteins:
>sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR
How to decide how similar these are?
Edit distance as similarity measure
g a
g a m m b l a
g 0 1 2 3 4 5 6 7 8 9 10 1 0 1 2 3 4 5 8 9 6 7 a 2 1 0 1 2 3 4 7 8 5 6 d 3 2 1 1 2 3 4 7 8 5 6 j 4 3 2 2 2 3 4 7 8 5 6 a 5 4 3 3 2 3 4 6 7 5 5 m 6 5 4 4 3 3 4 5 6 5 6 a 7 6 5 5 4 4 4 6 6 5 5 8 7 6 6 5 4 5 6 7 5 6 g r a m m a 9 8 7 6 6 5 5 7 7 6 6 10 9 8 7 7 6 6 7 8 6 7 11 10 9 8 7 7 7 7 8 7 6 12 11 10 9 8 8 8 6 7 8 7 13 12 11 10 9 9 9 7 6 9 8 14 13 12 11 10 10 10 10 9 8 7
Substitution matrices
Similarity Matrix Populārākās:
PAM Blossum Gonnet A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
"Traditional assumption": substitution score
> 0
that are more frequent as random ones, and
< 0
for substitutions for less frequent than random ones. [Adapted from M.Gerstein]
Global and local alignments
Gap penalties
[Adapted from M.Craven]
Global alignment with gap penalties
[Adapted from M.Craven]
Protein sequence databases and comparison tools >sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA
Assessment of the results - P-Values
• P(s > S) = .01 – P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time • Likewise for P=.001 and so on.
[Adapted from M.Gerstein]
ROC (Receiver Operating Characteristic) curves
Coverage 100% (roughly, fraction of sequences that one confidently “says something” about)
[sensitivity=tp/n=tp/(tp+fn)]
Thresh=30 Thresh=20 Thresh=10
Different score thresholds Two “methods” (red is more effective)
[Adapted from M.Gerstein] 100% Error rate (fraction of the “statements” that are false positives)
[Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n
MSA - Example
Multiple sequence alignment of 7 neuroglobins using ClustalX [Adapted from C.Struble]
MSA with dynamic programming approach
s
N NS N -
s
N N N
S
s
N NS NA
A V
s
NV NS NA
s
NV N N -
S A
s
NV N NA
s
NV NS NA
max
s s s s
s s
s
N N N NV N N N NS N N N NA N NS NA NV N NA NV NS N -
V S A S A
V A V S S V -
A
[Adapted from G.Church]
Protein sequence: ...VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANK...
[Adapted from R.Shamir]
Protein structures - representations
3D coordinates "Folds
Protein folds -
horseshoe
(
-
)
Problems related with protein structures
Determination (not really bioinformatics problem...) Prediction (
protein folding
problem; one of bioinformatic’s
Holy Grails...
) Comparison (not trivial, but there are methods that work reasonably well in practice) Representations Surface modelling Modelling and prediction opf protein interactions Visualisation
Haeckel
’s “Tree of life"
“Higher” organisms “Lower” organisms A phylogenetic tree is a hierarchical, graphical representation of relationships [Adapted from M.Thomas]
Phylogenetics
Phylogenetic trees Hierarchic clustering and dendrograms Types of phylogenetic trees "Molecular clock" Methods for phylogeny construction • from distance matrices • from character matrices Other phylogeny related problems (comparison, merging etc) Programs for construction and visualisation of phylogenetic trees
[Adapted from E.Willasen]
Phylogenetic trees
Distance matrices and property matrices [Adapted from M.Thomas]
Using Phylogeny to Understand Gene Duplication and Loss
A. A gene tree.
B. The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.
[Adapted from M.Thomas]
Genes are regulated (switched on or off) Gene regulation networks – outrageously simplified
DNA GENE 1 GENE 2 GENE 3 GENE 4 Specific proteins called
transcription factors
promoter coding DNA G1 G2 G3 G4
Types of models for biomolecular networks
What GRN models are useful for?
1. Simulation.
For given initial conditions compute how the system evolves with time.
2. Model checking.
Does the model behave according to the specifications it was constructed?
3. Reconstruction from data.
From a given set of data from biological measurements construct a model consistent with the data.
4. Constraints on parameters.
Find the requirements on parameters that must be satisfied to ensure a given “stable behaviour” .
5. System dynamics. Find all the possible “stable behaviours” that the biological system might exhibit.