Introduction to Biological sequences Sushmita Roy www.biostat.wisc.edu/bmi576/ [email protected] September 4, 2014 BMI/CS 576 Goals for today • A few key concepts in molecular biology – – – – Nucleic acids Genes Proteins The Central Dogma •

Download Report

Transcript Introduction to Biological sequences Sushmita Roy www.biostat.wisc.edu/bmi576/ [email protected] September 4, 2014 BMI/CS 576 Goals for today • A few key concepts in molecular biology – – – – Nucleic acids Genes Proteins The Central Dogma •

Introduction to
Biological sequences
Sushmita Roy
www.biostat.wisc.edu/bmi576/
[email protected]
September 4, 2014
BMI/CS 576
Goals for today
• A few key concepts in molecular biology
–
–
–
–
Nucleic acids
Genes
Proteins
The Central Dogma
• Connection between DNA, RNA and proteins
• Problems in sequence similarity
– Sequence alignment
– Sequence search
A Living Cell
• The fundamental unit of life
• There are unicellular (one cell) and multi-cellular
organisms
• A cell has different cellular components
• We will be concerned with
– Nucleus
– Ribosomes
– Cytoplasm
• prokaryotes (single-celled organisms lacking nucleus)
• eukaryotes (organisms with nucleus)
An animal cell
http://www.genome.gov/Glossary/index.cfm?id=25
Deoxyribonucleic acid (DNA)
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
DNA is a double helical molecule
Watson and Crick
Maurice
Wilkins
Rosalind Frankin
• In 1953, James Watson and Francis Crick
discovered DNA molecule has two strands
arranged in a double helix
• This was possible through the Xray diffraction
data from Maurice Wilkins and Rosalind Franklin
http://www.chemheritage.org/discover/online-resources/chemistry-in-history/themes/biomolecules/dna/watson-crick-wilkins-franklin.aspx
Nucleotides
• DNA is composed of small chemical units called
nucleotides
Phosphate
• Nucleotide
–
–
–
–
Nitrogen containing base
5 carbon sugar: deoxyribose
Phosphate group
Phosphate-hydroxy bonds connect the
nucleotides
Sugar
Hydroxy
• Four nucleotides make DNA
– adenine (A), cytosine (C), guanine (G) and thymine (T)
– Each nucleotide differs in the base
Base
Bases in the nucleotides
• Purines (Two rings)
Adenine (A)
Guanine (G)
• Pyrimidines (one ring)
Thymine (T)
Cytosine (C)
Nucleotides are linked to form one strand of
DNA
O
-O
P O
5’
CH2
Base
O-
1’
Sugar
4’
3’
2’
O
-O
P O
CH2
Base
5’
O-
1’
Sugar
4’
3’
2’
5’ and 3’ of a DNA molecule
• Each strand is made up of linkages
between 5’ position (Phosphate) on one
nucleotide to the 3’ position of the
following nucleotide
• At one end, there is a free phosphate
group: 5’ end
• At the other end, there is a free OH group:
3’ end
• Therefore we can talk about directionality
– the 5’ and the 3’ ends of a DNA strand
• The two strands are held-together through
base pairing
5’ and 3’ of a DNA molecule contd..
• DNA sequence is read from 5’ to 3’
• The two stands run anti-parallel to each other
– One is the complement of the other
• For example, if the AAG is the sequence on one
strand the sequence on the other strand is CTT
– Not TTC
Watson-Crick Base pairing
A always bonds to T
C always bonds to G
• This base-pairing is also called “complementary base-paring”
• Each strand has a base sequence that is complementary to the
sequence on the other strand.
• If you know the sequence on one strand, you know the sequence on
the other strand
DNA stores the blue print of an organism
• The heredity molecule
• Has the information needed to make an organism
• Double strandedness of the DNA molecule provides
stability, prevents errors in copying
– one strand has all the information
• DNA replication is the process by this information is
copied through generations of daughter cells
DNA replication
• Helicase, an enzyme, separates the double-helix
• DNA polymerase makes a copy of each strand using
free nucleotides
• Each strand of DNA serves as a template
5’
3’
CATTGCCCAGT
Strand A
5’
3’
CATTGCCCAGT
GTAACGGGTCA
5’
3’
Strand B
GTAACGGGTCA
5’
3’
5’
3’
CATTGCCCAGT
Template strand A
New strand B
New strand A
Parent DNA double helix
Adapted from “Understanding Bioinformatics”
G T A A C G G G T C A Template strand B
5’
3’
Videos on DNA replication
https://www.youtube.com/watch?v=zdDkiRw1
PdU
https://www.youtube.com/watch?v=27TxKoFU
2Nw
Chromosomes
• All the DNA of an organism is
divided up into individual
chromosomes
• Each chromosome is really a DNA
molecule
• Different organisms have different
numbers of chromosomes
Image from www.genome.gov
Different organisms have different numbers of
chromosomes
Organism
# of chromosomes
Yeast
32
Human
46
Fly
8
Mouse
40
Arabidopsis
10
Worm
12
Genes
• Genes are the units of heredity
• A gene is a sequence of bases
which specifies a protein or
RNA molecule
• The human genome has ~
25,000 protein-coding genes
(still being revised)
• One gene can have many
functions
• One function can require many
genes
…GTATGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTC…
Genomes
• Refers to the complete complement of DNA for a given
species
• The human genome consists of 2X23 chromosomes
• Every cell (except egg and sperm cells and mature red
blood cells) contains the complete genome of an organism
Some Greatest Hits
Genome
Where
Year
H. Influenza (bacteria)
TIGR
1995
E. Coli (K12)
Wisconsin
1997
S. cerevisiae (yeast)
International collab
1997
C. elegans (worm)
Washington
U./Sanger
1998
D. melanogaster (fruit fly)
Multiple groups
2000
E. Coli 0157:H7 (pathogen)
Wisconsin
2000
H. sapiens (humans)
International
Collab./Celera
2001
M. musculus (mouse)
International Collab.
2002
R. norvegicus (rat)
International Collab.
2004
Some Genome Sizes
Genome
# base pairs
HIV
9750
E. coli
4.6 billion
S. cerevisiae
12 million
C. elegans
97 million
D. melanogaster
137 million
H. sapiens
3.1 billion
The central dogma of Molecular biology
DNA
Transcription
RNA
Translation
Proteins
RNA: Ribonucleic acid
• RNA
– Made up of repeating nucleotides
– The sugar is ribose
– U is used in place of T
• A strand of RNA can be thought of as a string composed of
the four letters: A, C, G, U
• RNA is single stranded
– More flexible than DNA
– Can double back and form loops
– Such structures can be more stable
Transcription
• In eukaryotes: happens inside the nucleus
• RNA polymerase (RNA Pol) is an enzyme that builds an
RNA strand from a gene
• RNA Pol is recruited at specific parts of the genome in a
condition-specific way.
• Transcription factor proteins are assigned the job of RNA
Pol recruitment.
• RNA that is transcribed from a protein coding region is
called messenger RNA (mRNA)
Transcription
The RNA string produced is identical to the non-template strand except T is replaced by U.
The central dogma of Molecular biology
DNA
Transcription
RNA
Translation
Proteins
Translation
• Process of turning mRNA into proteins.
• Happens outside of the nucleus inside the cytoplasm in
ribosomes
• ribosomes are the machines that synthesize proteins from
mRNA
Proteins
•
•
•
•
Proteins are polymers too
The repeating units are amino acids
There are 20 different amino acids known
DNA codes for protein
– How many nucleotides are needed to specify 20 amino acids?
Amino Acids
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Ala
Arg
Asp
Asn
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
D
N
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
Codons
• Each triplet of bases is called a codon
• How many codons are possible?
• There are three special codons
– One Start codon: AUG: start of translation
– Three Stop codons: End of translation
• All others code for a particular amino acid
The Genetic Code: Specifies how mRNA
is translated into protein
Genetic code is degenerate
Codons and Reading Frames
3’
5’
CUC AGC GUU ACC AU
Leu Ser Val
Thr
C UCA GCG UUA CCA U
Ser Ala Leu Pro
CU CAG CGU UAC CAU
Gln Arg Tyr His
Proteins are the workhorses of the cell
•
•
•
•
•
•
structural support
transport of substances
coordination of an organism’s activities
response of cell to chemical stimuli
protection against disease
Catalyzing chemical reactions
Proteins are complex molecules
• Primary amino acid
sequence
• Secondary structure
• Tertiary structure
• Quarternary structure
• These structures are
formed through different
levels of protein folding
and packaging
Some well-known proteins
Hemoglobin: carries oxygen
Insulin: metabolism of sugar
http://en.wikipedia.org/wiki/Hemoglobin
http://en.wikipedia.org/wiki/Insulin
http://en.wikipedia.org/wiki/Actin
Actin:
maintenance of
cell structure
Hemoglobin protein HBA1
DNA sequence (491 bp)
>gi|224589807:226679-227520 Homo sapiens
chromosome 16, GRCh37.p9 Primary
Assembly
1 CCCACAGACT CAGAGAGAAC CCACCATGGT GCTGTCTCCT
GACGACAAGA CCAACGTCAA
61 GGCCGCCTGG GGTAAGGTCG GCGCGCACGC TGGCGAGTAT
GGTGCGGAGG CCCTGGAGAG
121 GATGTTCCTG TCCTTCCCCA CCACCAAGAC CTACTTCCCG
CACTTCGACC TGAGCCACGG
181 CTCTGCCCAG GTTAAGGGCC ACGGCAAGAA
GGTGGCCGAC GCGCTGACCA ACGCCGTGGC
241 GCACGTGGAC GACATGCCCA ACGCGCTGTC
CGCCCTGAGC GACCTGCACG CGCACAAGCT
301 TCGGGTGGAC CCGGTCAACT TCAAGCTCCT
AAGCCACTGC CTGCTGGTGA CCCTGGCCGC
361 CCACCTCCCC GCCGAGTTCA CCCCTGCGGT GCACGCCTCC
CTGGACAAGT TCCTGGCTTC
421 TGTGAGCACC GTGCTGACCT CCAAATACCG TTAAGCTGGA
GCCTCGGTGG CCATGCTTCT
481 TGCCCCTTTG G
Amino acid sequence (142 aa)
>sp|P69905|HBA_HUMAN
Hemoglobin subunit alpha
OS=Homo sapiens GN=HBA1
PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAG
EYGAEALERMFLSFPTTKTYFPHFDL
SHGSAQVKGHGKKVADALTNAVAH
VDDMPNALSALSDLHAHKLRVDPV
NFKLLSHCLLVTLAAHLPAEFTPAVHA
SLDKFLASVSTVLTSKYR
RNA genes
• Not all genes encode proteins
• For some genes the end product is RNA
– ribosomal RNA (rRNA), which includes major
constituents of ribosomes
– transfer RNAs (tRNAs), which carry amino acids to
ribosomes
– micro RNAs (miRNAs), which play an important
regulatory role in various plants and animals
– linc RNAs (long non-coding RNAs), play important
regulatory roles
RECAP
• Key components of a eukaryotic cell
– Nucleus, Cytoplasm, Ribosome
• What is DNA and RNA?
– A large molecule called a polymer
– Made up of repeated units
• Nucleotides
– DNA: ATGC
– RNA: AUGC
• What is a protein
– Also a polymer, but the units are amino acids
• The Central Dogma: DNA->RNA->protein
• Important processes
– DNA replication, Transcription, Translation
• Some resources
– http://www.genome.gov/Glossary/index.cfm
http://www.youtube.com/watch?v=41_Ne5mS2ls
A video on transcription and translation
Things we did not talk about
•
•
•
•
DNA packaging
Alternative splicing
Polyadenylation
Post translational modifications
A few important biological data/knowledge
bases
• 2014 Nucleic acids Research Database reports 1,552 databases
• National Center of Biotechnology (NCBI)
– http://www.ncbi.nlm.nih.gov
– GenBank: Database of sequences
– Refseq: Reference sequences
• Ensemble
– http://useast.ensembl.org/info/about/index.html
• UniProt: Protein sequence and protein function
• Protein Databank: Protein structure
• Pathway databases
– Gene Ontology
– KEGG
• Interaction databases
– BioGRID
– STRING
See also http://nar.oxfordjournals.org/content/42/D1/D1.full#T1
Number of genomes in RefSeq
Source: http://www.ncbi.nlm.nih.gov/refseq/statistics/
Sequence similarity
• Sequence similarity is central to addressing many
questions in biology
– Are two sequences related?
• Similarity in sequence can imply similarity in function.
– Assign function to uncharacterized sequences based on
characterized sequences
• Sequence from different species can be compared to
estimate the evolutionary relationships between species
– We will come back to this in Phylogenetic trees.
Overview of sequence similarity problems
• Assessing similarity between a small number of DNA
or protein sequences
– Pairwise sequence alignment
– Multiple sequence alignment
• Searching databases for a query sequence
– Heuristic search using BLAST
What is sequence alignment
The task of locating equivalent regions of two or
more sequences to assess their overall similarity
A very simple alignment of two sequences
THI S SEQUENCE
THATSEQUENCE
Aligned/matched
positions
How to align these two sequences?
THI S SEQUENCE
THATISASEQUENCE
The problem arises when the sequences to be compared are of unequal length
How do sequences change?
• Sequences change through mutations
substitutions: ACGA
AGGA
insertions: ACGA
ACGGA
deletions: ACGA
AGA
Need to incorporate gaps while aligning
sequences
_ _ _T H I S S E Q U E N C E
THI S___SEQUENCE
THATISASEQUENCE
THATISASEQUENCE
Alignment 1: 3 gaps, 8 matches
Alignment 2: 3 gaps, 9 matches
Issues in sequence alignment
• What type of alignment?
– Align the entire sequence or part of it?
– Two sequences or multiple sequences?
• How to find the alignment?
– Search algorithms for alignment
• How to score an alignment?
– the sequences we’re comparing typically differ in length
– some characters (nucleotide or aminoacid) are more substitutable
than others
• How to tell if the alignment is biologically meaningful?
– Assessing how likely the alignment could have happened by
random chance
Algorithms for alignment
• Pairwise alignment algorithms based on Dynamic
programming
– Global alignment
– Local alignment
• Multiple sequence alignment
– Progressive/Guide-tree based approaches
– Iterative alignments
• BLAST
– Searching a query sequence in a database of sequences
with efficient pre-processing
Scoring alignments
• Percent identity
• Substitution matrices of amino acids
– Genuine matches may not be identical
– PAM, BLOSUM50 matrices
• Gap penalty functions
Reading assignment for Sep 9th
• Chapter 2, Sections 2.1-2.3, from Textbook: Biological
Sequence Analysis