Transcript Microarrays

Introduction to Bioinformatics

Juris Viksna, IMCS UL

2015

Alvis Brazma, European Bioinformatics Institute

Planned course schedule

Regular lectures:

Wednesdays, 14:30-16:00

Most likely the first 4 (or so) will stick to this, schedule, followed by a break in March, and then restarting in April, with replacement lectures scheduled sometime in April-May.

Will try my best to invite guest lecturers (most likely this will involve rescheduled lecture times), but this is subject to options that might (or might not) become available.

Course requierements

To obtain a credit for this course you must: - submit a programming project

(worth 50% of grade);

- take a (written) exam

(worth 50% of grade).

Course web page: http://susurs.mii.lu.lv/juris/courses/bi2015.html

Bioinformatics

• •

Databases

and tools to store and access biomolecular data

Sequence algorithms

– assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and

phylogenetics

• 3D

structure

analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular

interaction

analysis (e.g., protein interactions) • Dynamic systems, modelling of

biological networks

systems and • Analysis of noisy measurement data,

statistical analysis

Data management

, databases, interfaces, web services • Links with health records,

biomedical informatics

Why Bioinformatics might be important for you?

• This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed

Topics covered in this course:

• Introduction into biology as information science •

Overview

of some bioinformatics problems • Bio

sequence and structure analysis

, molecular

evolution and phylogeny

etc • • Genomics – DNA assembly,

haplotypes

etc

Gene regulation network modelling

networks, dynamic systems) (graph theory, Boolean • Analysis of gene expression data, cluster analysis,

data mining and analysis

• Data management and analysis for biomedical studies • Some new recently evolving topics (time and material availability permitting...)

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)

N.p.k. Datums 1.

2.

3.

13.09.2013

Lekcijas temats Ievadlekcija. Prasības kursa apgūšanai un literatūras avoti. Bioinformātikas jēdziens. Kas ir bioinformātika un kāpēc tā biologiem vajadzīga? Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi 20.09.2013

Bioloģiskās informācijas veidi un apjoms. Genomu organizācija. Modernās genomu analīzes metodes 27.09.2013

Genomu evolūcija. Salīdzinošā genomika 4.

5.

6.

7.

04.10.2013

Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas 11.10.2013

Dažādu bioloģiskās informācijas datubāžu izmantošanas piemēri 18.10.2013

Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Nukleīnskābju un proteīnu sekvenču pāru salīdzināšana. BLAST veidi 25.10.2013

Nukleīnskābju un proteīnu daudzkārtējās salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi. Datorprogrammas nukleīnskābju un proteīnu sekvenču daudzkārtējai salīdzināšanai 8 9 10.

11.

12.

13.

14.

01.11.2013

Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā 08.11.2013

Seminārs un uzdevumu pārbaude par tēmām, kas saistītas ar informācijas meklēšanu datu bāzēs un sekvenču homoloģijas meklēšanu 15.11.2013

Datorprogrammas nukleīnskābju un proteīnu sekvenču filoģenētiskajai analīzei 22.11.2013

Nodarbība nenotiks 29.11.2013

Makromolekulu telpiskā struktūra un tās paredzēšana. DNS topoloģija. Proteīnu struktūras paredzēšana, modelēšana un pielietojums farmakoloģijā 13.09.2013.

Genoma ekspresijas analīze. Transkriptomika. DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika. Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa.

13.12.2013 Seminārs un uzdevumu pārbaude par tēmām, kas saistītas ar filoģenētisko analīzi un proteīnu sekundārās struktūras paredzēšanu. Bioinformātikas perspektīvas. Bioinformātika kā

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)

• Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)

• Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)

• DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai

NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY July 17, 2000

• •

Bioinformatics:

Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data.

Computational Biology:

The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.

Human Genome Project

• Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published

The results of HGP

• 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics

The things however have recently changed again to be stored and/or analyzed.

– with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs

All you need to know about Molecular Biology

One of the first textbooks in bioinformatics

MIT press 2000

Few other textbooks for «Computer Scientists»

MIT press 2004 Chapman and Hall/CRC 1995

Some bioinformatics problems from the perspective of Computer Science

Genome sequencing and assembly

Genome sequencing and assembly

E.Green (2001)

Strategies for the systematic sequencing of complex genomes.

Nat Rev Genetics, Vol 2:8, 573-583.

Ensembl genome browser

Genome sequence assembly

Genome sequence assembly

Sequence assembly problem

Ok, let us assume that we have these hybridizations.

How can we reconstruct the initial DNA sequence from them?

Affymetrix GeneChip W.Bains, C.Smith (1988)

A novel method for nucleic acid sequence determination.

Journal of theoretical biology .Vol. 135:3, 303-307.

Sequence assembly problem

Ok, let us assume that we have these hybridizations.

How can we reconstruct the initial DNA sequence from them?

SBH – Hamiltonian path approach

SBH – Hamiltonian path approach

Hamiltonian path (cycle problem)

Hamiltonian path (cycle) problem

For a given graph find a path (cycle) that visits every

vertex

exactly once (or show that such path does not exist).

NP

-hard.

That means that there are no algorithm that works in realistic time already for comparatively small graphs.

SBH – Eulerian path approach

Eulerian path (cycle) problem

Eulerian path (cycle) problem

For a given graph find a path (cycle) that visits every

edge

exactly once (or show that such path does not exist).

Eulerian path (cycle) problem

Eulerian cycle exists if and only if each of graph vertices has

even degree

. Moreover, there is a simple linear time algorithm for finding Eulerian cycle.

Eulerian path (cycle) problem

For a given graph find a path (cycle) that visits every

edge

exactly once (or show that such path does not exist).

Next Generation Sequencing (Illumina)

In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50 150 bp).

Sequence mappers

Sequence assembly – deBruijn graphs

Sequence assembly – deBruijn graphs

D.Zebino, E.Birney (2008)

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

. Genome Research, Vol. 18:5, 821 829.

All you need to know about Molecular Biology

Central dogma of molecular biology

transcription DNA translation RNA Protein

DNA

Four different nucleotides :

adenosine

,

guanine

,

cytosine

and

thymine

. They are usually referred to as denoted by their initial letters,

A

,

C

,

G

and

T

bases

and 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'

DNA

DNA - Biology as and information science

5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as

CGATTCAACGATGC

The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately

75 GB or 12.5 CD-Roms per cm

.

DNA replication – copying the information

Polymerase chain reaction – PCR – Xeroxing the DNA

Genome sequencing

• Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas – A version of PCR – Separation of molecules by chemical properties such as weight or length of the DNA – Molecule labelling and fluorescent labelling in particular – DNA fragmentation in random length bits

Anatomy of a chromosome

• Centromeres are the largest constriction of the chromosome •

Site of attachment of spindle fibers

100,000s of 171 base pair repeat, called alpha satellite sequences

Centromere associated proteins are bound

[Adapted from R.Yasbin]

Genomes, chromosomes

Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes Organism Number or chromosomes Bacteria Yeast 1 12 Genome size in base pairs ~400,000 - ~10,000,000 14,000,000 Worm Fly Weed Human 6 4 5 23 100,000,000 300,000,000 125,000,000 3,000,000,000

Genome sizes

Organism Bacteria Yeast Worm Fly Weed Human Number or chromosomes 1 12 6 4 5 23 Genome size in base pairs ~400,000 ~10,000,000 14,000,000 100,000,000 300,000,000 125,000,000 3,000,000,000 Year sequenced 1995 1997 1999 2000 2001 2003 Information in the human genome – up to

0.75 TB

www.ensembl.org

Genomes and genes

control statement TATA box start Ribosome binding 5’ utr Termination (stop) control statement gene

Transcription (RNA polymerase)

mRNA

Translation (Ribosome)

3’ utr Protein

Chromosomes - Eucariots

Chromosomes - Procariots

(Eucariotic) cell

[Adapted from Online Biology Book]

(Procariotic) cell

Viruses

Operones

[Adapted from R.Shamir]

Exons and introns

Exons and introns

Gene regulation

[Adapted from http://www.gennetworks.co.uk/products/trrd.shtml]

RNA

• Like DNA, RNA consists of 4 nucleotides, but instead of the thymine (T), it has an alternative uracil (U) • RNA is similar to a DNA, but it’s chemical properties are such that it keeps itself single stranded • RNA is complimentary to a single stranded DNA 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' DNA | | | | | | | | | | | | | | | 3' G-C U -A-A-C-G U U -G-C U -A-C-G 5' RNA

RNA structure

RNA sequence: ...AGGCUAUGGCCA...

Single-stranded, but A tends to pair with U G tends to pair with C

Splicing, translation, proteins

Ribosomes – read in RNA and use single amino acids to output proteins

Proteins

• A chain of 20 different amino acids • Have complex 3 D shapes • They are the main building blocks of a cell – 10% of the cellular mass is proteins, almost all the rest is water – Structural proteins – Functional proteins • Enzymes – biological catalysts

Protein structure and their function

Proteins are chains of 20 different types of amino acids, and they have complex structures determined by their sequence. The structures in turn determine their functions.

Discovering the genetic code

• Was done by feeding all 64 different nucleotide triplets to protein synthesis machinery (which was provided with all 20 amino acids) • This allowed to map triplets to amino acids • It turned out that the code is the same in all organisms –

the genetic code is universal!

Genetic code

Genetic code Completely worked out in 1962

Why we need to compare sequences?

  Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes)   What are biological functions of these genes??

We can find out what protein (sequence) gene encodes   But we still do not know what this protein does...

However we can search for known proteins with

similar

sequences and such that functions of these proteins are known    We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans...

But we can try to use similar protein (e.g. in mice) and start our experiments with them

Evolution of sequences

     Mutations are a natural process of DNA evolution DNA

replication

errors: 

substitutions

insertions

deletions

}

indels

Similarity between sequences:  indicates their common ancestral origin  indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions

How to compare sequences?

Given two proteins:

>sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR

How to decide how similar these are?

Edit distance as similarity measure

g a

g a m m b l a

g 0 1 2 3 4 5 6 7 8 9 10 1 0 1 2 3 4 5 8 9 6 7 a 2 1 0 1 2 3 4 7 8 5 6 d 3 2 1 1 2 3 4 7 8 5 6 j 4 3 2 2 2 3 4 7 8 5 6 a 5 4 3 3 2 3 4 6 7 5 5 m 6 5 4 4 3 3 4 5 6 5 6 a 7 6 5 5 4 4 4 6 6 5 5 8 7 6 6 5 4 5 6 7 5 6 g r a m m a 9 8 7 6 6 5 5 7 7 6 6 10 9 8 7 7 6 6 7 8 6 7 11 10 9 8 7 7 7 7 8 7 6 12 11 10 9 8 8 8 6 7 8 7 13 12 11 10 9 9 9 7 6 9 8 14 13 12 11 10 10 10 10 9 8 7

Substitution matrices

Similarity Matrix Populārākās:

PAM Blossum Gonnet A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

"Traditional assumption": substitution score

> 0

that are more frequent as random ones, and

< 0

for substitutions for less frequent than random ones. [Adapted from M.Gerstein]

Global and local alignments

Gap penalties

[Adapted from M.Craven]

Global alignment with gap penalties

[Adapted from M.Craven]

Protein sequence databases and comparison tools >sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA

Assessment of the results - P-Values

• P(s > S) = .01 – P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time • Likewise for P=.001 and so on.

[Adapted from M.Gerstein]

ROC (Receiver Operating Characteristic) curves

Coverage 100% (roughly, fraction of sequences that one confidently “says something” about)

[sensitivity=tp/n=tp/(tp+fn)]

Thresh=30 Thresh=20 Thresh=10

Different score thresholds Two “methods” (red is more effective)

[Adapted from M.Gerstein] 100% Error rate (fraction of the “statements” that are false positives)

[Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n

MSA - Example

Multiple sequence alignment of 7 neuroglobins using ClustalX [Adapted from C.Struble]

MSA with dynamic programming approach

s

N NS N -

s

N N N

S

s

N NS NA

A V

s

NV NS NA

s

NV N N -

 

S A

s

NV N NA

s

NV NS NA

 max           

s s s s

         

s s

  

s

N N N NV N N N NS N N N NA N NS NA NV N NA NV NS N -

             

V S A S A

V A V S S V -

 

A

        [Adapted from G.Church]

Protein sequence: ...VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANK...

[Adapted from R.Shamir]

Protein structures - representations

3D coordinates "Folds

Protein folds -

horseshoe

(

-

)

Problems related with protein structures

 Determination (not really bioinformatics problem...)  Prediction (

protein folding

problem; one of bioinformatic’s

Holy Grails...

)  Comparison (not trivial, but there are methods that work reasonably well in practice)  Representations  Surface modelling  Modelling and prediction opf protein interactions  Visualisation

Haeckel

’s “Tree of life"

“Higher” organisms “Lower” organisms A phylogenetic tree is a hierarchical, graphical representation of relationships [Adapted from M.Thomas]

Phylogenetics

     Phylogenetic trees Hierarchic clustering and dendrograms Types of phylogenetic trees "Molecular clock" Methods for phylogeny construction • from distance matrices • from character matrices  Other phylogeny related problems (comparison, merging etc)  Programs for construction and visualisation of phylogenetic trees

[Adapted from E.Willasen]

Phylogenetic trees

Distance matrices and property matrices [Adapted from M.Thomas]

Using Phylogeny to Understand Gene Duplication and Loss

A. A gene tree.

B. The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.

[Adapted from M.Thomas]

Genes are regulated (switched on or off) Gene regulation networks – outrageously simplified

DNA GENE 1 GENE 2 GENE 3 GENE 4 Specific proteins called

transcription factors

promoter coding DNA G1 G2 G3 G4

Types of models for biomolecular networks

What GRN models are useful for?

1. Simulation.

For given initial conditions compute how the system evolves with time.

2. Model checking.

Does the model behave according to the specifications it was constructed?

3. Reconstruction from data.

From a given set of data from biological measurements construct a model consistent with the data.

4. Constraints on parameters.

Find the requirements on parameters that must be satisfied to ensure a given “stable behaviour” .

5. System dynamics. Find all the possible “stable behaviours” that the biological system might exhibit.