Computational Biology Lecture #1 & 2: Introduction  Bud Mishra Professor of Computer Science and Mathematics 9 ¦ 17 ¦ 2002 11/7/2015 ©Bud Mishra, 2001 L1-1

Download Report

Transcript Computational Biology Lecture #1 & 2: Introduction  Bud Mishra Professor of Computer Science and Mathematics 9 ¦ 17 ¦ 2002 11/7/2015 ©Bud Mishra, 2001 L1-1

Computational Biology
Lecture #1 & 2: Introduction

Bud Mishra
Professor of Computer Science and Mathematics
9 ¦ 17 ¦ 2002
11/7/2015
©Bud Mishra, 2001
L1-1
Syllabus
• Introductory Material
–
–
–
–
• Functional Genomics
What do we know?
Biological information
Biotechnology (e.g. arrays, PCR,
hybridization; single molecules; mass
spectrometry)
Some biology (terminology)
• Mapping and Sequencing
• Population Genetics
–
–
–
–
–
–
Taking cells at different stages of
development, what can we infer from
gene expression levels data? Can we
determine the sequence of gene
activation? Tools that allow biologists
to try to answer these questions.)
Genetic Networks
Clustering algorithms
Diseases
Linkage analysis
Kinship analysis
• Comparative Genomics
–
–
–
Phylogeny
Gene rearrangements between species
Gene families within specie
11/7/2015
©Bud Mishra, 2001
L1-2
Active Areas of Research(1)
• Human Genome Project: (Completed?)
– Read 3 billion base pairs in 46 human chromosomes
– Deemed “substantially completed on June 27,
2000.”
• Polymorphisms and Haplotyping
– SNPs (Single Nucleotide Polymorphisms): Catalog
the single base pair variations occurring about 1 in
800 base pairs of human genome over the entire
populations
– RFLP-Map: Restriction Fragment Length
Polymorphisms
11/7/2015
L1-3
©Bud Mishra, 2001
Active Areas of Research(2)
• Transcription Maps:
– Identify all (about 30,000 (?)) the genes in the
human genome.
– Particularly interesting are the ones involved in
cancer…About 100 oncogenes and 1000 tumor
suppressor genes
• Linkage Analysis:
– Relate genes (or polymorphic markers) to
phenotypes (externally observable traits) by
analyzing genomes of a family (kinship) or over a
population.
11/7/2015
L1-4
©Bud Mishra, 2001
Active Areas of Research(3)
• Functional Genomics:
– Understand how an interactive network of
genes affect a chain of metabolic pathways to
ultimately determine the phenotypes
• Comparative Genomics:
– Relate genes within and across species to
understand their evolutionary
relationship…Phylogeny.
11/7/2015
©Bud Mishra, 2001
L1-5
Active Areas of Research(4)
• Cell Informatics:
– Interaction between proteins (membrane and
soluble ones) to determine the dynamics of a
cell.
– Interaction among a heterogeneous
population of cells.
• Rational Drug Design:
– Design of drugs and delivery systems to
modify the dynamics of the cells.
11/7/2015
©Bud Mishra, 2001
L1-6
Introduction to Biology
• Genome:
– Hereditary information of an organism is encoded in
its DNA and enclosed in a cell (unless it is a virus).
All the information contained in the DNA of a single
organism is its genome.
• DNA molecule can be thought of as a very long
sequence of nucleotides or bases:
S = {A, T, C, G}
11/7/2015
©Bud Mishra, 2001
L1-7
Complementarity
• DNA is a double-stranded polymer and should be thought of as a
pair of sequences over S. However, there is a relation of
complementarity between the two sequences:
– A , T, C , G
– That is if there is an A (respectively, T, C, G) on one sequence at
a particular position then the other sequence must have a T
(respectively, A, G, C) at the same position.
• We will measure the sequence length (or the DNA length) in terms
of base pairs (bp): for instance, human (H. sapiens) DNA is 3.3 £
109 bp measuring about 6 ft of DNA polymer completely stretched
out!
11/7/2015
©Bud Mishra, 2001
L1-8
Genome Size
The genomes vary widely in size:
measuring from »
• Few thousand base pairs for
viruses to 2 » 3 £ 1011bp for
certain amphibian and flowering
plants.
• Coliphage MS2 (a virus) has the
smallest genome: only 3.5 £
103bp.
• Mycoplasmas (a unicellular
organism) has the smallest cellular
genome: 5 £ 105bp.
• C. elegans (nematode worm, a
primitive multicellular organism)
has a genome of size » 108bp.
11/7/2015
Species
Haploid
Genome Size
Chrom
osome
Numer
E. Coli
4.64 £ 106
1
S.cerevisae
1.205 £ 107 16
C. elegans
108
11/12
D. melanogaster
1.7 £ 108
4
M. musculus
3 £ 109
20
H. sapiens
3 £ 109
23
A. Cepa (Onion) 1.5 £ 1010
©Bud Mishra, 2001
8
L1-9
Goal of a Genome Study
E.g. Human Genome Project
•
•
Genetic Maps:
Physical Maps: (For instance, the Human Genome Project [HGP] requires a
•
•
DNA Sequencing:
Gene Identification: Identify genes (parts of the DNA involved in controlling the
complete map of the human genome at a resolution of 100 Kb = 105bp. That is, the
map would consist of “markers” spaced at most 105bp apart.)
metabolic processes through proteins they encode) on physical maps or sequenced
DNA.
• Informatics: Elucidate the structure of the DNA as encoding of all the relevant
information.
– Diagnostic and Therapeutic Tools: Necessary for the treatment of genetic
diseases.
– Phylogenetic Tools: Used in understanding the process and mechanism of
evolution.
11/7/2015
©Bud Mishra, 2001
L1-10
DNA ) Structure and Components
• The usual configuration of DNA is in terms of a double helix
consisting of two chains or strands coiling around each other with
two alternating grooves of slighltly different spacing. The
“backbone” in each strand is made of alternating big sugar
molecules (Deoxyribose residues: C5 O4 H10) and small phosphate
((P O4)-3) molecules.
• Now, one of the four bases (the letters in our alphabet S), each one
an almost planar nitrogenic organic compound, is connected to the
sugar molecule. The bases are:
–
–
–
–
Adenine ) A
Thymine ) T
Cytosine ) C
Guanine ) G
11/7/2015
©Bud Mishra, 2001
L1-11
Genome in Detail
The Human Genome at
Four Levels of Detail.
Apart from reproductive
cells (gametes) and mature
red blood cells, every cell in
the human body contains
23 pairs of chromosomes,
each a packet of compressed
and entwined DNA (1, 2).
11/7/2015
©Bud Mishra, 2001
L1-12
DNA ) Structure and Components
(contd.)
• The sequence of bases defines the information encoded by the
DNA.
• Complementary base pairs (A-T and C-G) are connected by
hydrogen bonds and the base-pair forms a coplanar “rung”
connecting the two strands.
•
– Cytosine and thymine are smaller (lighter) molecules, called
pyrimidines
– Guanine and adenine are bigger (bulkier) molecules, called purines.
– Adenine and thymine allow only for double hydrogen bonding, while
cytosine and guanine allow for triple hydrogen bonding.
Thus the chemical (through hydrogen bonding) and the mechanical
(purine to pyrimidine) constraints on the pairing lead to the
complementarity and makes the double stranded DNA both chemically
inert and mechanically quite rigid and stable.
11/7/2015
©Bud Mishra, 2001
L1-13
DNA Structure.
The four nitrogenous bases of DNA
are arranged along the sugarphosphate backbone in a particular
order (the DNA sequence),
encoding all genetic instructions for
an organism. Adenine (A) pairs
with thymine (T), while cytosine
(C) pairs with guanine (G). The two
DNA strands are held together by
weak bonds between the bases.
11/7/2015
©Bud Mishra, 2001
L1-14
DNA ) Structure and Components
(contd.)
•
•
•
•
•
The building blocks of the DNA molecule are four kinds of
deoxyribonucleotides,
– where each deoxyribonucleotide is made up of a sugar residue, a
phosphate group and a base.
– From these building blocks (or related, dNTPs deoxyribonucleoside
triphosphates) one can synthesize a strand of DNA.
The sugar molecule in the strand is in the shape of a pentagon (4 carbons
and 1 oxygen) in a plane parallel to the helix axis and with the 5th carbon
(5' C) sticking out.
The phosphodiester bond (-O-P-O-) between the sugars connects this 5' C
to a carbon in the pentagon (3' C) and provides a directionality to each
strand.
The strands in a double-stranded DNA molecule are antiparallel.
Most of the enzymes moving along the backbone moves in the 5'-3'
direction.
11/7/2015
©Bud Mishra, 2001
L1-15
The Central Dogma
•
•
•
The intermediate molecule carrying the information out of the nucleus of
an eukaryotic cell is RNA, a single stranded polymer.
RNA also controls the translation process in which amino acids are created
making up the proteins.
The central dogma(due to Francis Crick in 1958) states that these
information flows are all unidirectional:
“The central dogma states that once `information' has passed into protein
it cannot get out again. The transfer of information from nucleic acid
to nucleic acid, or from nucleic acid to protein, may be possible, but
transfer from protein to protein, or from protein to nucleic acid is
impossible. Information means here the precise determination of
sequence, either of bases in the nucleic acid or of amino acid residues in
the protein.”
11/7/2015
©Bud Mishra, 2001
L1-16
RNA and Transcription
•
•
The polymer RNA (ribonucleic acid) is similar to DNA but differ in several ways:
– it's single stranded;
– its nucleotide has a ribose sugar (instead of deoxyribose) and
– it has the pyrimidine base uracil, U, substituting thymine, T-- U is
complementary to A like thymine.
RNA molecule tends to fold back on itself to make helical twisted and rigid
segments.
– For instance, if a segment of an RNA is
5' - GGGGAAAACCCC - 3',
– then the C's fold back on the G's to make a hairpin structure (with a 4bp stem
and a 5bp loop).
– The secondary RNA structure can even be more complicated, for instance, in
case of E. coli, Ala tRNA (transfer RNA) forms a cloverleaf shape.
– Prediction of RNA structure is an interesting computational problem.
11/7/2015
©Bud Mishra, 2001
L1-17
RNA, Genes and Promoters
•
•
•
A specific region of DNA that determines the synthesis of proteins (through the
transcription and translation) is called a gene
– Originally, a gene meant something more abstract---a unit of hereditary
inheritance.
– Now a gene has been given a physical molecular existence.
Transcription of a gene to a messenger RNA, mRNA, is keyed by an RNA
polymerase enzyme, which attaches to a core promoter (a specific sequence adjacent
to the gene).
Regulatory sequences such as silencers and enhancers control the rate of
transcription
– by their influence on the RNA polymerase through a feedback control loop
involving many large families of activator and repressor proteins that bind with
DNA and
– which in turn, transpond the RNA polymerase by coactivator proteins and basal
factors.
11/7/2015
©Bud Mishra, 2001
L1-18
Transcriptional Regulation of Gene
• The entire structure of transcriptional regulation of gene expression
is rather dispersed and fairly complicated:
– The enhancer and silencer sequences occur over a wide region spanning
many Kb's from the core promoter on either directions;
– A gene may have many silencers and enhancers and can be shared
among the genes;
– They are not unique---different genes may have different
combinations;
– The proteins involved in control of the RNA polymerase number
around 50 and
– Different cliques of transcriptional factors operate in different cliques.
• Any disorder in their proper operation can lead to cancer, immune
disorder, heart disease, etc.
11/7/2015
©Bud Mishra, 2001
L1-19
Transcription
• The transcription of DNA in to mRNA is performed with a single
strand of DNA (the sense strand) around a gene.
• The double helix
– Untwists momentarily to create a transcriptional bubble which moves
along the DNA in the 3' - 5' direction (of the sense strand)
– As the complementary mRNA synthesis progresses adding one RNA
nucleotide at a time at the 3' end of the RNA, attaching an U
(respectively, A, G and C) for the corresponding DNA base of A
(respectively, T, C and G),
– Ending when a termination signal (a special sequence) is encountered.
• This newly synthesized mRNA are capped by attaching special
nucleotide sequences to the 5' and 3‘ ends.
• This molecule is called a pre-mRNA.
11/7/2015
©Bud Mishra, 2001
L1-20
Gene Expression
11/7/2015
•When genes are expressed, the genetic
information (base sequence) on DNA is
first transcribed (copied) to a molecule of
messenger RNA, mRNA.
•The mRNAs leave the cell nucleus and
enter the cytoplasm, where triplets of bases
(codons) forming the genetic code specify
the particular amino acids that make up an
individual protein.
•This process, called translation, is
accomplished by ribosomes (cellular
components composed of proteins and
another class of RNA) that read the genetic
code from the mRNA, and transfer RNAs
(tRNAs) that transport amino acids to the
ribosomes for attachment to the growing
protein.
©Bud Mishra, 2001
L1-21
Exons and Introns
• In eukaryotic cells, the region of DNA transcribed into a premRNA involves more than just the information needed to
synthesize the proteins.
• The DNA containing the code for protein are the exons, which are
interrupted by the introns, the non-coding regions.
• Thus pre-mRNA contains both exons and introns and is altered to
excise all the intronic subsequences in preparation for the
translation process---this is done by the spliceosome.
• The location of splice sites, separating the introns and exons, is
dictated by short sequences and simple rules such as
– “introns begin with the dinucleotide GT and end with the dinucleotide
AG” (the GT-AG rule).
11/7/2015
©Bud Mishra, 2001
L1-22
Protein and Translation
• The translation process begins at a particular location of the mRNA
called the translation start sequence (usually AUG) and is mediated
by the transfer RNA (tRNA), made up of a group of small RNA
molecules, each with specificity for a particular amino acid.
• The tRNA's carry the amino acids to the ribosomes, the site of
protein synthesis, where they are attached to a growing polypeptide.
• The translation stops when one of the three trinucleotides UAA,
UAG or UGA is encountered.
• Each 3 consecutive (nonoverlapping) bases of mRNA
(corresponding to a codon codes for a specific amino acid.
• There are 43 = 64 possible trinucleotide codons belonging to the set
{U, A, G, C}3
11/7/2015
©Bud Mishra, 2001
L1-23
Genetic Codes
• The codon AUG is the start codon and the codons UAA, UAG and
UGA are the stop codons.
– That leaves 60 codons to code for 20 amino acids with an expected
redundancy of 3!
– Multiple codons (one to six) are used to code a single amino acid.
• The line of nucleotides between and including the start and stop
codons is called an open reading frame (ORF)
• All the information of interest to us resides in the ORF's.
• The mapping from the codons to amino acid (and naturally
extended to a mapping from ORF's polypeptides by a
homomorphism) given by
FP : {U, A, G, C}3 ! {A, R, D, N, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}
11/7/2015
©Bud Mishra, 2001
L1-24
Amino Acids with Codes
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
11/7/2015
Ala
Cys
Asp
Glu
Phe
Gly
His
Ile
Lys
Leu
Met
Asn
Pro
Gln
Arg
Ser
Thr
Val
Trp
Tyr
alanine
cysteine
aspertic acid
glutamic acid
phenylanine
glycine
histine
isoleucine
lysine
leucine
methionine
asparginine
proline
glutamine
arginine
serine
threonine
valine
tryptophan
tyrosine
GC(U+A+C+G)
UG(U+C)
GA(U+C)
GA(G+A)
UU(U+C)
GG(U+A+C+G)
CA(U+C)
AU(U+A+C)
AA(A+G)
(C+U)U(A+G) + CU(U+C)
AUG
AA(U+C)
CC(U+A+C+G)
CA(A+G)
(A+C)G(A+G)+CG(U+C)
(AG+UC)(U+C)+UC(A+G)
AC(U+A+C+G)
GU(U+A+C+G)
UGG
UA(U+C)
©Bud Mishra, 2001
L1-25
The Cell
A cell is a small coalition of a set of genes held together in a set
of chromosomes (and even perhaps unrelated
extrachromosomal elements).
• They also have set of machinery made of proteins, enzymes,
lipids and organelles taking part in a dynamic process of
information processing.
– In eukaryotic cells the genetic materials are enclosed in the cell nucleus
separated from the other organelles in the cytoplasm by a membrane.
– In prokaryotic cells the genetic materials are distributed
homogeneously as it does not have a nucleus.
– Example of prokaryotic cells are bacteria with a considerably simple
genome.
11/7/2015
©Bud Mishra, 2001
L1-26
Organelles
• The organelles common to eukaryotic plant and
animal cells include
– Mitochondria in animal cells and chloroplasts in plant cells
(responsible for energy production);
– A Golgi apparatus (responsible for modifying, sorting and
packaging various macromolecules for distribution within
and outside the cell);
– Endpolastic reticulum (responsible for synthesizing
protein); and
– Nucleus (responsible for holding the DNA as
chromosomes and replication and transcription).
11/7/2015
L1-27
©Bud Mishra, 2001
Chromosomes
• The entire cell is contained in a sack made of plasma membrane. In plant
cells, they are further surrounded by a cellulose cell wall.
• The nucleus of the eukaryotic cells contain its genome in several
chromosomes, where each chromosome is simply a single molecule of DNA
as well as some proteins (primarily histones).
• The chromosomes can be a circular molecule or linear, in which case the
ends are capped with special sequence of telomeres.
• The protein in the nucleus binds to the DNA and effects the compaction of
the very long DNA molecules.
• In somatic cells (as opposed to gametes: egg and sperm cells) of most
eukaryotic organisms, the chromosomes occur in homologous pairs, with
the only exceptions being the X and Y chromosomes---sex chromosomes.
11/7/2015
©Bud Mishra, 2001
L1-28
Chromosomes
•Karyotype.
•Microscopic examination
of chromosome size and
banding patterns identifies
24 different chromosomes
in a karyotype, which is
used for diagnosis of
genetic diseases.
•The extra copy of
chromosome 21 (trisomy)
in this karyotype implies
Down's syndrome.
11/7/2015
©Bud Mishra, 2001
L1-29
Ploidy
• Gametes contain only unpaired chromosomes; the egg
cell contains only X chromosome and the sperm cell
either an X or an Y chromosome. The male has X and Y
chromosomes; the female, 2 X's.
• Cells with single unpaired chromosomes are called
haploid; the cells with homologous pairs, diploid; the
cells with homologous triplet, quadruplet, etc.,
chromosomes are called polyploid---many plant cells
are polyploid.
11/7/2015
©Bud Mishra, 2001
L1-30
The dynamics of cell:
• The cell cycle ) the set of events that occur
within a cell between its birth by mitosis and its
division into daughter cells again by mitosis
– interphase period when DNA is synthesized and
– mitotic phase
• The cell division by mitosis (into 2 daughter cells) and
meiosis (into 4 gametes from germ-line cells);
• Working of the machinery within the cell---mainly the
ones involving replication of DNA, transcription of DNA
into RNA and translation of RNA into protein.
11/7/2015
©Bud Mishra, 2001
L1-31
The Cell Cycle:
•In growing cells, the four phases proceed
successively, taking from 10-20 hrs.
•Interphase: comprises the G1, S, and G2
phases. DNA is synthesized in S and other
cellular macromolecules are synthesized
throughout interphase, roughly doubling
cell’s mass.
•During G2 the cell is prepared for mitotic
(M) phase when the genetic material is
evenly proportioned and the cell divides.
G0
M
G1
G2
S
•Nondividing cells exit the normal cycle,
entering the quiesecent G0 state.
11/7/2015
©Bud Mishra, 2001
L1-32
Differentiation & Suicide
• Cellular dynamics controls how a cell changes (or
differentiates) to carry out a specialized functions
– Structural or morphological changes (muscles, neural, skin..)
– Immune systems: Many cell types come together in organized
tissues designed to let the body distinguish self from non-self.
• Programmed Cell Death/Apoptosis:
– Condensation of the nucleus.
– Fragmentation of the DNA.
– Morphological changes followed by consumption by
macrophages.
11/7/2015
©Bud Mishra, 2001
L1-33
Cell Talk
Ligand
• Cell Surface Receptors
Binding
Receptor
extracellular
domain
transmembrane
domain
Lipid
Layer
cytoplasmic
domain
Coupling with
Membrane
associated
molecules
Trafficking
11/7/2015
Signalling
– Extracellular domain for
binding ligands (e.g., growth
factors, adhesion molecules,
etc.)
– Transmembrane domain
– Intracellular cytoplasmic
domain
• Receptor driven cellular
behavior are extremely
important
– E.g., Growth, Secretion,
Contraction, Motility and
Adhesion
©Bud Mishra, 2001
L1-34
Receptors and Gene Regulation
• Ligands bind to receptors at
the cell surface.
• Bound receptors activate
various intracellular enzymes
and initiate entire cascades of
intracellular reactions
signal
cascade
gene
regulation
11/7/2015
Short term
response
Long term
response
– Some of these regions trigger
short term (of the order of
milliseconds to minutes)
responses.
– Some eventually trigger longterm responses..e.g., requiring
protein synthesis and
additional molecular
interactions
©Bud Mishra, 2001
L1-35
A Complex Picture
binding
signaling
internalization
coupling
Surface
binding
events
recycling
synthesis
degradation Intracellular
signaling
11/7/2015
©Bud Mishra, 2001
trafficking
events
L1-36
A Complex Picture
• Trafficking
– Receptor population undergoes many
complex events of coupling with other cell
surface molecules
– Internalization (RME: receptor-mediated
endocytosis)
– Recycling
– Degradation
– Synthesis
11/7/2015
©Bud Mishra, 2001
L1-37
Interrupted Genes:
• An open reading frame (containing a
gene) consists of
– INTRONS: Intervening sequences a
Noncoding regions
– EXONS: Protein coding regions
• Introns are abundant in eukaryotes and
certain animal viruses.
11/7/2015
©Bud Mishra, 2001
L1-38
Interrupted Genes:
Intron1
Intron3
Intron2
Exon1
Exon2
DNA
Transcription
RNA
Splicing
Primary transcript
mRNA
11/7/2015
©Bud Mishra, 2001
L1-39
Interrupted Genes:
• Introns can occur between individual
codons or within a single codon
Nucleus
hnRNA
(heterogeneous nuclear
RNA)
Mixture of primary transcripts
with varying numbers of
introns spliced.
Cell
mRNA
11/7/2015
©Bud Mishra, 2001
L1-40
Some Genes…
Gene Product
Organism
Exon
Length
#Introns
Intron
Length
Adenoshine deaminase
Human
1500
11
30,000
Apolipoprotein B
Human
14,000
28
29,000
Erythropoietin
Human
582
4
1562
Thyroglobulin
Human
8500
= 40
100,000
a-interferon
Human
600
0
0
Fibroin
Silk Worm
18,000
1
970
Phaseolin
French Bean
1263
5
515
11/7/2015
©Bud Mishra, 2001
L1-41
Regulation of Gene Expressions
• Motifs (short DNA sequences) that regulate transcription
– Promoter
– Terminator
• Motifs that modulate transcription
– Repressor
– Activator
– Antiterminator
Promoter
Terminator
10-35bp
11/7/2015
Transcriptional
Initiation
©Bud Mishra, 2001
Gene
Transcriptional
Termination
L1-42
Promoters
• pol I (RNA polymerase I)
– Transcribes ribosomal RNA genes 100 » 1000 bp in
front of the gene
• pol II (RNA polymerase II)
– Transcribes genes encoding polypeptides
– Complex and variable regulatory regions
• pol III (RNA polymerase III)
– Transcribes transfer RNA and other small RNAs
– Both up and down stream
11/7/2015
©Bud Mishra, 2001
L1-43
Motifs
• Each motif is a binding site for a specific protein
• Transcription Factor:
– Transcription factors (specific to a cell/environmental
conditions) bind to regulatory regions and facilitate
• Assembly of RNA polymerase into a transcriptional complex
• Activation of a transcriptional complex.
• Termination Factor:
– Assembly of proteins for termination and modification of the
end of the RNA
• Epigenetic Changes
– Methylation of the cytosine in the 5’ region
– Structural changes in cromatin
11/7/2015
©Bud Mishra, 2001
L1-44
Organization of Genetic Information
• Bacterial Genome:
– Genes are closely spaced along the DNA.
– The sequences of genes may overlap.
– Related genes (encoding enzymes whose
functions are part of the same pathway or
whose activities are related) are linked as a
single transcription unit.
11/7/2015
©Bud Mishra, 2001
L1-45
Organization of Genetic Information
• Eukaryotic Genome:
– Genes are separated by long stretches of
noncoding DNA sequences.
– Multiple genes in a single transcription unit is
extremely rare.
– Multiple chromosomes – Linear
– Chloroplasts and mitochondria – Circular
– Genes appearing on the same chromosome
are syntenic.
11/7/2015
©Bud Mishra, 2001
L1-46
Location of Some Genes on
Human Chromosome.
Genes
chromosomes
Genes
chromosomes
a-globin cluster
16
Insulin
11
b-globin cluster
11
Galactokinase
11
Viral oncogene homologues
Immunoglobulin
k (light chain)
2
C-sis
22
l (light chain)
22
C-mos
8
Heavy Chain
14
C-Ha-Ras-1
11
Pseudogenes
9,32,15,18
C-myb
6
Growth Hormone gene 17
cluster
Thymidine kinase
11/7/2015
17
Interferons
a & b cluster
9
g
12
©Bud Mishra, 2001
L1-47
Eukaryotic Genome
• Multiple copies of the same gene
– Solve “supply problem”
– There are several hundred ribosomal RNA genes I
mammals
• Pseudogenes
– Nonfunctional copies of genes…(Deletions or
alterations in the DNA sequence)
– Number of pseudo genes for a particular gene varies
greatly…Different from one organism to another.
11/7/2015
©Bud Mishra, 2001
L1-48
Genes in Eukaryotes
• A gene may appear exactly once
• It may be part of a family of repeated sequence .
Members of a family may be clustered or dispersed.
• Members of a gene family may be related and functional
(expressed at different times in development, or in
different cells) or may be pseudo genes.
• Chromosomal Morphology:
– Nucleolar organizers (genes for ribosomal RNA)
– Telomeric and Centromeric regions (Tandemly repeated
sequences)
11/7/2015
©Bud Mishra, 2001
L1-49
The Rearrangement of DNA Sequences
• Reshuffling of genes between homologous
chromosomes via reciprocal crossing-over
during both meiosis and mitosis.
• Gene synteny and linkages are usually preserved.
• Most rearrangements are random.
• Some rearrangements are normal processes
altering gene expressions in an orderly and
programmed manner.
11/7/2015
©Bud Mishra, 2001
L1-50
Chromosomal Aberrations
• Breakage
• Translocation (Among non-homologous
chromosomes.)
• Formation of acentric and dicentric chromosomes.
• Gene Conversions
• Amplification and deletions
• Point mutations
• Jumping genes a Transposition of DNA segments
• Programmed rearrangements a E.g., antibody
responses.
11/7/2015
©Bud Mishra, 2001
L1-51
Repeat Structure
• Copy Number: 2 » 106
• Direct Repeats “head-to-tail”
– Tandem repeats or separated by other sequences
• Inverted Repeats “head-to-head”
– Stem-and-loop structure
– Hairpin structure
• Reverse Palindrome
• True Palindrome
11/7/2015
©Bud Mishra, 2001
L1-52
Repeat Structure
• Tandem Direct Repeats
• Inverted Repeats
G
C
A
T
C
G
C
G
T
A
G
C
5’-AAGAG AAGAG AAGAG-3’
5’-GTCCAGNL NCTGGAC-3’
CAGGTCNL NGACCTG
Stem-and-loop structure
Associated with inverted repeats
5’-GAATTC-3’
CTTAAG
• Reverse Palindrome
• True Palindrome
5’-GTCAATGA AGTAACTG-3’
11/7/2015
©Bud Mishra, 2001
L1-53
Repeats within the Genome
• Gene Family
– Genes and its cognate pseudogenes
• Satellite: Repeats made of noncoding units
– Minisatellites: Tandem repeats…Mostly in
centromeric regions
– Satellite repeat units vary in length freom 2
base pairs to several thousands.
11/7/2015
©Bud Mishra, 2001
L1-54
Interspersed Repeats
• SINES: Short Interspersed Repeats
– Each repeat unit is of length 100 – 500 bps
– Processed pseudogenes derived from class III
genes
– Example: Alu repeats…dimeric head-to-tail
repeats of 130 bp
• LINES: Long Interspersed Repeats
– Each unit is of length > 6 Kb.
11/7/2015
©Bud Mishra, 2001
L1-55
A Genome Grammar
• Consists of
– A stochastic grammar specifying target DNA
sequence together with
– A description of polymorphisms and
– A description of the sampling strategy for
experiments
• h specificationi ! h DNA-Seg i
h Poly-Seg i*
h Sample-Seg i+
11/7/2015
©Bud Mishra, 2001
L1-56
Stochastic Grammar
• h DNA-Seg i !
“.dna” h DNA-Spec i
• h Poly-Seg i !
“.poly” h Weight i+ h Poly-Spec i
• h Sample-Seg i !
“.sample” h Sample-spec i
11/7/2015
©Bud Mishra, 2001
L1-57
DNA Sequence
• .dna
A = 150
à sequence of length 150—
Pr(A) = Pr(T) = Pr(C) = Pr(G) = ¼
B = A A m(.30) Ã A followed by a mutated
copy of A---Pr(Mutation) = .30
C » 3-7 p(.2, .3, .3) Ã A string of length 3 to 7,
Pr(A) =.2, Pr(T) = .3, Pr(C)=.3, Pr(G) = .2
---C = Constant String
D = C m(0.03) n(10,30) Ã m = mutation rate,
n = copy number
• S = 30,000,000
B m(.05, .10) p(.1,.1,.01) n(10)
D !(500)
11/7/2015
©Bud Mishra, 2001
L1-58
Polymorphisms
• Modify the ancestral sequence by a series of
– S
– D
– X
•
Point mutation (SNPs)
Deletions
Translocations
.poly .8 .8
S 0.00012T
D 1-1 .00012
D 2-2 .00006
D 3-3 .00002
D 500-1000 .00005
X 1000-2000 .0005
.poly .4
S .001
D 1-2 .0005
11/7/2015
Two haplotypes of .8 each and
one haplotype of weight .4
©Bud Mishra, 2001
L1-59
Sampling
• .sample
48,000
à Number of Samples
400 600 .5
à Read Lengths
.01 .02
à Sequence Read Errors
.33 .33
à Failure of Read
.3 1800 2200 .005 Ã Clone size
• .sample
12,000
400 600 .5
.01 .03
.33 .33
.4 9000 11000 .015
11/7/2015
©Bud Mishra, 2001
L1-60
Experiment
• First sample generate 48,000 end reads from
inserts of average length 2 Kbp.
– Sample proportions: 40% from haplotype H1, 40%
from H2 and 20% from H3
• Second sample generates 12,000 end reads from
inserts of average length 10 Kbp.
– Sample proportions: 40% from haplotype H1, 40%
from H2 and 20% from H3
11/7/2015
©Bud Mishra, 2001
L1-61