生物計算

Download Report

Transcript 生物計算

Chapter 6 Genomics and Gene Recognition

暨南大學資訊工程學系黃光璿 (HUANG, Guan Shieng) 2004/04/26 1

Motivation

  Cells can determine the beginnings and ends of genes.

How can we identify genes algorithmically?

 prokaryotic genomes  eukaryotic genomes 2

Review

DNA Sequencing

   Determine the order of nucleotides in a DNA fragment Maxam-Gilbert method, 1970 Sanger’s Chain-termination method 4

Base-calling

   Phred program  Developed at the University of Washinton in 1998, can convert traces (analog signals) into sequences (digital signals).

<50: noisy >800: signals declined 7

High-throughput Sequencing

    Four-color fluorescent dyes have replaced the radioactive label.

Reads greater than 800 bp are possible, though 500~700 is more common.

Applied Biosystem's ABI Prism TM  six 96-well plates per day  96 X 6 X 800 ~ 0.5 M 3700 Amersham Pharmacia's Mega BASE 1000 TM 9

6.1 Prokaryotic Genomes

Should contain at least information to    make and replicate its DNA; make new proteins; obtain and store energy.

6.1.1 Contig Assembly

  TIGR (The Institute for Genome Research)  have made bacterial genome sequencing as a cottage industry Example  bio-terrorism mailings (anthrax strains, 炭疽病株 ), late 2001.

6.2 Prokaryotic Gene Structure

6.2.1 Promoter Elements

 promoter  a binding site in a DNA chain at which RNA polymerase binds to initiate transcription of messenger RNA by one or more nearby structural genes 16

6.2.1.1 RNA polymerases

    β’: to bind to DNA template β: to link one nucleotide to another α: to hold all subunits together σ : to recognize the specific nucleotide sequences (which is less conserved) 17

6.2.1.2

6.2.1.3

   consensus sequence  recognized by the same σ-factor  agree for many different genes operon  the set of genes with related functions regulatory proteins   positive regulator  negative regulator  attenuate ( 減弱 ) enhance ( 強化 ) repress ( 抑制 ), 20

 lactose ( 乳糖 ) operon ( 操縱子 ) (in E. coli)  beta-galactosidase (z)  lactose permease (y)  lactose transacetylase (a)  One long polycistronic RNA makes all three proteins.

6.2.1.4 E. Coli

’

s Lac Operon

  σ 70 Most efficiently expressed only when a cell’s environment is rich in lactose ( 乳糖 ) and also poor in glucose ( 葡萄糖 )  lactose   combined with negative regulator pLacI  gene expressed!

 glucose   enhanced!

positive regulator CRP  gene 22

6.2.2 Open Reading Frames

   stop codons  UAA, UAG, UGA  (1 - 3/64) N = 0.05  N~63

E. Coli

 average length = 316.8 codons, 1.8% shorter than 60 codons Open Reading Frame (ORF)  continuous triplet codons without stop codon 25

   start codon  AUG

E. Coli

 AUG ~ 83%, UUG ~ 17% How to determine the starting position for translation?

  start codon Shine-Delgarno sequence   A,G-rich region serves as ribosome loading sites E.g., 5’ – AGGAGGT – 3’ 26

6.2.4 Conceptual Translation

6.2.4 Termination Sequences

(refer to transcription)  > 90% prokaryotic operons contain intrinsic terminators  inverted repeat (7~20 bp, G-C rich)  (e.g., 5’- CGGATG|CATCCG-3’) ~ 6 U’s following the inverted repeat  cause RNA polymerases to pause ~ 1 min (RNA polymerases incorporate ~ 100 nt/sec) 29

6.3 GC-Content in Prokaryotic Genomes

 G/C to A/T relative ratio  recognized as a distinguishing attribute of bacterial genomes  GC: 25% ~ 75%, wide range  GC-content of each bacterial species  seems to be independently shaped by mutational biases 31

 GC-contents are generally uniform throughout bacteria’s genomes  horizontal gene transfer  the movement of genetic material between bacteria other than by descent in which information travels through the generations as the cell divides  GC-contents reflect the evolutionary history of the bacteria 32

Prokaryotic Gene Density

  85%~88% are associated with the coding regions

E. Coli

 4288 genes, average length 950 bp, separated by 118 bp.

 Finding genes in prokaryotic genomes is relatively easy.

 Long open reading frames (>60);  Matches to simple promoter sequences;   Transcriptional termination signal; Comparisons with the nucleotide sequences of known protein coding regions from other organisms.

6.5 Eukaryotic Genomes

 Differences (to prokaryotic genomes)   Internal membrane-bound compartments allows them to maintain a wide variety of chemical environment.

eukaryotes  Multicellular organisms, each cell type usually has a distinctive pattern of gene expression.

 relatively little constraint on the size of their genomes  gene expressions, more complicated & flexible 35

6.6 Eukaryotic Gene Structure

  1000 times harder than finding a needle in a haystack???

Long open reading frames  is not appropriated since introns exist.

 Grail EXP & GenScan  Rely on neural network and dynamic programming.

 prediction < 50% 38

 Detecting features include  promoter  a series of introns/exon boundaries  putative ORF with codon usage bias 39

6.6.1 Promoter Elements

  prokaryotes  single RNA polymerase eukaryotes  three kinds of RNA polymerases 40

 RNA polymerase I, III  are needed at fairly constant levels in all eukaryotic cells at all times.

 RNA polymerase II  basal promoter  RNA polymerase II initiation complex is assembled and transcription begins.

 upstream promoter elements   protein binding Have been estimated that at least 5 upstream promoter elements are required to uniquely identify the genes.

 RNA polymerase II does not recognize the basal promoter directly.

 basal transcription factors   TATA-binding protein (TBP) at least 12 TBP-associated factors (TAFs)   TATA-box for eukaryotes (-25)  5’ – TATAWAW – 3’ (W= A or T) initiator (Inr) sequence  5’ – YYCARR – 3’ (Y=C or T, R=A or G) 43

 Transcription factor differences  cause tissue-specific expression of some gene.

6.6.2 Regulatory Protein Binding Sites

  bacteria  RNA polymerases have high affinity for promoters.

 emphasis on negative regulation eukaryotes  RNA polymerases II & III do not assemble around promoters very efficiently.

 additional emphasis on positive regulations 46

Transcription Factors

  constitutive  Do not respond to external signal.

regulatory  Do respond to external signals.

 sequence-specific DNA-binding protein 47

6.7 Open Reading Frames

  Nuclear membrane  separates the process of transcription and translation.

DNA  mRNA   hnRNA (heterogeneous RNA)  translation capped, spliced, poly-A    capped: chemical alteration (e.g., methylation) splicing: removal of introns polyadenylation: ~ 250 A’s at the 3’ end 49

 Splicing causes a serious problem for gene recognition algorithm.

 Do not have to posses the statistically significant long ORFs.

6.7.1 Introns and Exons

   GU-AG rule  5’ – GU - ……………………… - AG – 3’ Splicing apparatus  scrutinizes sequences lying within introns;  do not constrain the information content of exons.

introns: > 60 bp; ~ 450 bp in vertebrates 52

6.7.2 Alternative Splicing

 20% of human genes have alternative splicing  extreme example: 64 different mRNAs  Splicing apparatus have  small nuclear RNAs (snRNAs)  several proteins 53

6.8 GC Content in Eukaryotic Genomes

  Eukaryotic ORFs are much harder to recognize.

Help to detect the functionalities (genes, promoters, codon choices) of genomes.

6.8.1 CpG Islands

    statistical evaluation of the frequencies of  GG, GA, GT, GC, …… CpG: 5’ – CG – 3’  p: phosphodiester bond CpG only 20% of the frequency as by chance CpG islands: at the normal level 60

 In human genome  ~ 45000 CpG islands  housekeeping gene, most  tissue-specific genes < 40%  Approximately 70% of the CpG in mammalian genomes are methylated at cytosine.

Methylation

  甲基化 + CH 3 + oxidative deamination  Ċ  T  5’- ĊG -3’  5’- TG – 3’ 64

  DNA methylation  <-> acetylation (CH 3 CO, 乙醯基 ) of histones  methylation  & histone acetylation   high level of gene expression  Methylation patterns of DNA are somewhat difficult to determine experimentally.

Histone

  euchromatin  less tightly  active heterochromatin  densely packed  inactive 69

6.8.2 Isochores

  long regions of homogeneous base composition (w.r.t. GC-content)  >1M in length  GC-content of an isochore is relatively uniform throughout (by sliding window of 1000 bp, difference < 1%).

mosaic the genome 70

 human genome  5 different classes of isochores     L1 (39%), L2 (42%), H1 (46%), H2 (49%), H3 (54%) H3: at least 20 times of the density of genes to L1 |H3| ~ 3% to 5%, contains ~80% of all housekeeping genes.

|L1|+|L2| ~ 66%, contain ~ 85% of tissue specific genes.

 GC-rich region tends to  have low level of methylation;      be stored as transcriptionally active euchromatin; have promoter sequence closer the the transcriptional start site; have shorter introns and genes; have short interspersed nuclear elements (SINEs); bias the use of codons.

6.8.3 Codon Usage Bias

  In yeast genome  arginine: AGA 48% (CGT, CGC, CGA, CGG, AGG) Codon usage bias can be used to detect exons (reflecting the preference).

6.9 Gene Expression

 Eukaryotic gene recognition algorithms employ  Known promoter elements (i.e., TATA and CAAT boxes);   CpG islands; Splicing signal associated with introns;   ORFs with characteristic codon utilizations; Similarity to the ESTs or genes from other organism.

6.9.1 cDNAs and ESTs

   cDNA (complementary DNAs) cDNAs can be cloned into vectors and maintained as a cDNA library.

Fig. 6.11

 ESTs (expressed sequence tags)  cDNAs provide   the population of genes being expressed; the mRNA’s relative abundance.

 reassociation kinetics: R 0 t 1/2 79

6.9.2 Serial Analysis of Gene Expression

 SAGE     Make cDNAs from the sample; Break the cDNAs into small fragments (10~14 nt) by restriction enzymes; Randomly ligate the fragments into longer DNA molecules that are cloned and sequenced; Use computer to recognize the original small fragments and compare them to known transcripts from the organism.

 Tags reflect the relative abundance of the corresponding transcript.

6.9.3 Microarrays

 high-density oligonucleotide arrays (HDAs) 81

6.10 Transposition

   DNA transposons insertion sequences retrotransposons 82

6.11 Repetitive Elements

tandemly repeated DNA  satellite DNA  5~200 bp repeat unit, millions of copies   minisatellite  < 25 bp microsatellite  < 4 bp 83

  interspersed repeat  LINE (long interspersed nuclear element)  SINE (short interspersed nuclear element) DNA finger print 84

6.12 Eukaryotic Gene Density

    3000 Mb human genome < 90 M (3%) corresponds to coding sequences.

810Mb (27%) are associated with introns, promoters, and pseudogenes.

2100 Mb (70%)   1680 Mb (56%) unique sequences 420 Mb (14%) repetitive DNA 85

參考資料及圖片出處

Fundamental Concepts of Bioinformatics Dan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003. Biochemistry 2001. , by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, Biology , by Sylvia S. Mader, 8th edition, McGraw-Hill, 2003. 86

生物計算

Transcript 生物計算

Chapter 6 Genomics and Gene Recognition

Motivation

Review

DNA Sequencing

Base-calling

High-throughput Sequencing

6.1 Prokaryotic Genomes

6.1.1 Contig Assembly

6.2 Prokaryotic Gene Structure

6.2.1 Promoter Elements

6.2.1.1 RNA polymerases

6.2.1.2

6.2.1.3

6.2.1.4 E. Coli

s Lac Operon

6.2.2 Open Reading Frames

6.2.4 Conceptual Translation

6.2.4 Termination Sequences

6.3 GC-Content in Prokaryotic Genomes

Prokaryotic Gene Density

6.5 Eukaryotic Genomes

6.6 Eukaryotic Gene Structure

6.6.1 Promoter Elements

6.6.2 Regulatory Protein Binding Sites

Transcription Factors

6.7 Open Reading Frames

6.7.1 Introns and Exons

6.7.2 Alternative Splicing

6.8 GC Content in Eukaryotic Genomes

6.8.1 CpG Islands

Methylation

Histone

6.8.2 Isochores

6.8.3 Codon Usage Bias

6.9 Gene Expression

6.9.1 cDNAs and ESTs

6.9.2 Serial Analysis of Gene Expression

6.9.3 Microarrays

6.10 Transposition

6.11 Repetitive Elements

6.12 Eukaryotic Gene Density

參考資料及圖片出處

Directory