生物計算

Download Report

Transcript 生物計算

Chapter 6 Genomics and Gene Recognition

暨南大學資訊工程學系 黃光璿 (HUANG, Guan Shieng) 2004/04/26 1

Motivation

  Cells can determine the beginnings and ends of genes.

How can we identify genes algorithmically?

 prokaryotic genomes  eukaryotic genomes 2

Review

3

DNA Sequencing

   Determine the order of nucleotides in a DNA fragment Maxam-Gilbert method, 1970 Sanger’s Chain-termination method 4

5

6

Base-calling

   Phred program  Developed at the University of Washinton in 1998, can convert traces (analog signals) into sequences (digital signals).

<50: noisy >800: signals declined 7

8

High-throughput Sequencing

    Four-color fluorescent dyes have replaced the radioactive label.

Reads greater than 800 bp are possible, though 500~700 is more common.

Applied Biosystem's ABI Prism TM  six 96-well plates per day  96 X 6 X 800 ~ 0.5 M 3700 Amersham Pharmacia's Mega BASE 1000 TM 9

10

6.1 Prokaryotic Genomes

Should contain at least information to    make and replicate its DNA; make new proteins; obtain and store energy.

11

6.1.1 Contig Assembly

12

13

  TIGR (The Institute for Genome Research)  have made bacterial genome sequencing as a cottage industry Example  bio-terrorism mailings (anthrax strains, 炭疽 病株 ), late 2001.

14

6.2 Prokaryotic Gene Structure

15

6.2.1 Promoter Elements

 promoter  a binding site in a DNA chain at which RNA polymerase binds to initiate transcription of messenger RNA by one or more nearby structural genes 16

6.2.1.1 RNA polymerases

    β’: to bind to DNA template β: to link one nucleotide to another α: to hold all subunits together σ : to recognize the specific nucleotide sequences (which is less conserved) 17

18

6.2.1.2

19

6.2.1.3

   consensus sequence  recognized by the same σ-factor  agree for many different genes operon  the set of genes with related functions regulatory proteins   positive regulator  negative regulator  attenuate ( 減弱 ) enhance ( 強化 ) repress ( 抑制 ), 20

 lactose ( 乳糖 ) operon ( 操縱子 ) (in E. coli)  beta-galactosidase (z)  lactose permease (y)  lactose transacetylase (a)  One long polycistronic RNA makes all three proteins.

21

6.2.1.4 E. Coli

s Lac Operon

  σ 70 Most efficiently expressed only when a cell’s environment is rich in lactose ( 乳 糖 ) and also poor in glucose ( 葡萄糖 )  lactose   combined with negative regulator pLacI  gene expressed!

 glucose   enhanced!

positive regulator CRP  gene 22

23

24

6.2.2 Open Reading Frames

   stop codons  UAA, UAG, UGA  (1 - 3/64) N = 0.05  N~63

E. Coli

 average length = 316.8 codons, 1.8% shorter than 60 codons Open Reading Frame (ORF)  continuous triplet codons without stop codon 25

   start codon  AUG

E. Coli

 AUG ~ 83%, UUG ~ 17% How to determine the starting position for translation?

  start codon Shine-Delgarno sequence   A,G-rich region serves as ribosome loading sites E.g., 5’ – AGGAGGT – 3’ 26

27

6.2.4 Conceptual Translation

28

6.2.4 Termination Sequences

(refer to transcription)  > 90% prokaryotic operons contain intrinsic terminators  inverted repeat (7~20 bp, G-C rich)  (e.g., 5’- CGGATG|CATCCG-3’) ~ 6 U’s following the inverted repeat  cause RNA polymerases to pause ~ 1 min (RNA polymerases incorporate ~ 100 nt/sec) 29

30

6.3 GC-Content in Prokaryotic Genomes

 G/C to A/T relative ratio  recognized as a distinguishing attribute of bacterial genomes  GC: 25% ~ 75%, wide range  GC-content of each bacterial species  seems to be independently shaped by mutational biases 31

 GC-contents are generally uniform throughout bacteria’s genomes  horizontal gene transfer  the movement of genetic material between bacteria other than by descent in which information travels through the generations as the cell divides  GC-contents reflect the evolutionary history of the bacteria 32

Prokaryotic Gene Density

  85%~88% are associated with the coding regions

E. Coli

 4288 genes, average length 950 bp, separated by 118 bp.

33

 Finding genes in prokaryotic genomes is relatively easy.

 Long open reading frames (>60);  Matches to simple promoter sequences;   Transcriptional termination signal; Comparisons with the nucleotide sequences of known protein coding regions from other organisms.

34

6.5 Eukaryotic Genomes

 Differences (to prokaryotic genomes)   Internal membrane-bound compartments allows them to maintain a wide variety of chemical environment.

eukaryotes  Multicellular organisms, each cell type usually has a distinctive pattern of gene expression.

 relatively little constraint on the size of their genomes  gene expressions, more complicated & flexible 35

36

6.6 Eukaryotic Gene Structure

  1000 times harder than finding a needle in a haystack???

Long open reading frames  is not appropriated since introns exist.

37

 Grail EXP & GenScan  Rely on neural network and dynamic programming.

 prediction < 50% 38

 Detecting features include  promoter  a series of introns/exon boundaries  putative ORF with codon usage bias 39

6.6.1 Promoter Elements

  prokaryotes  single RNA polymerase eukaryotes  three kinds of RNA polymerases 40

 RNA polymerase I, III  are needed at fairly constant levels in all eukaryotic cells at all times.

41

 RNA polymerase II  basal promoter  RNA polymerase II initiation complex is assembled and transcription begins.

 upstream promoter elements   protein binding Have been estimated that at least 5 upstream promoter elements are required to uniquely identify the genes.

42

 RNA polymerase II does not recognize the basal promoter directly.

 basal transcription factors   TATA-binding protein (TBP) at least 12 TBP-associated factors (TAFs)   TATA-box for eukaryotes (-25)  5’ – TATAWAW – 3’ (W= A or T) initiator (Inr) sequence  5’ – YYCARR – 3’ (Y=C or T, R=A or G) 43

44

 Transcription factor differences  cause tissue-specific expression of some gene.

45

6.6.2 Regulatory Protein Binding Sites

  bacteria  RNA polymerases have high affinity for promoters.

 emphasis on negative regulation eukaryotes  RNA polymerases II & III do not assemble around promoters very efficiently.

 additional emphasis on positive regulations 46

Transcription Factors

  constitutive  Do not respond to external signal.

regulatory  Do respond to external signals.

 sequence-specific DNA-binding protein 47

48

6.7 Open Reading Frames

  Nuclear membrane  separates the process of transcription and translation.

DNA  mRNA   hnRNA (heterogeneous RNA)  translation capped, spliced, poly-A    capped: chemical alteration (e.g., methylation) splicing: removal of introns polyadenylation: ~ 250 A’s at the 3’ end 49

 Splicing causes a serious problem for gene recognition algorithm.

 Do not have to posses the statistically significant long ORFs.

50

6.7.1 Introns and Exons

51

   GU-AG rule  5’ – GU - ……………………… - AG – 3’ Splicing apparatus  scrutinizes sequences lying within introns;  do not constrain the information content of exons.

introns: > 60 bp; ~ 450 bp in vertebrates 52

6.7.2 Alternative Splicing

 20% of human genes have alternative splicing  extreme example: 64 different mRNAs  Splicing apparatus have  small nuclear RNAs (snRNAs)  several proteins 53

54

55

56

57

58

6.8 GC Content in Eukaryotic Genomes

  Eukaryotic ORFs are much harder to recognize.

Help to detect the functionalities (genes, promoters, codon choices) of genomes.

59

6.8.1 CpG Islands

    statistical evaluation of the frequencies of  GG, GA, GT, GC, …… CpG: 5’ – CG – 3’  p: phosphodiester bond CpG only 20% of the frequency as by chance CpG islands: at the normal level 60

61

62

 In human genome  ~ 45000 CpG islands  housekeeping gene, most  tissue-specific genes < 40%  Approximately 70% of the CpG in mammalian genomes are methylated at cytosine.

63

Methylation

  甲基化 + CH 3 + oxidative deamination  Ċ  T  5’- ĊG -3’  5’- TG – 3’ 64

  DNA methylation  <-> acetylation (CH 3 CO, 乙醯基 ) of histones  methylation  & histone acetylation   high level of gene expression  Methylation patterns of DNA are somewhat difficult to determine experimentally.

65

Histone

66

67

68

  euchromatin  less tightly  active heterochromatin  densely packed  inactive 69

6.8.2 Isochores

  long regions of homogeneous base composition (w.r.t. GC-content)  >1M in length  GC-content of an isochore is relatively uniform throughout (by sliding window of 1000 bp, difference < 1%).

mosaic the genome 70

 human genome  5 different classes of isochores     L1 (39%), L2 (42%), H1 (46%), H2 (49%), H3 (54%) H3: at least 20 times of the density of genes to L1 |H3| ~ 3% to 5%, contains ~80% of all housekeeping genes.

|L1|+|L2| ~ 66%, contain ~ 85% of tissue specific genes.

71

72

 GC-rich region tends to  have low level of methylation;      be stored as transcriptionally active euchromatin; have promoter sequence closer the the transcriptional start site; have shorter introns and genes; have short interspersed nuclear elements (SINEs); bias the use of codons.

73

6.8.3 Codon Usage Bias

  In yeast genome  arginine: AGA 48% (CGT, CGC, CGA, CGG, AGG) Codon usage bias can be used to detect exons (reflecting the preference).

74

6.9 Gene Expression

 Eukaryotic gene recognition algorithms employ  Known promoter elements (i.e., TATA and CAAT boxes);   CpG islands; Splicing signal associated with introns;   ORFs with characteristic codon utilizations; Similarity to the ESTs or genes from other organism.

75

6.9.1 cDNAs and ESTs

   cDNA (complementary DNAs) cDNAs can be cloned into vectors and maintained as a cDNA library.

Fig. 6.11

76

77

78

 ESTs (expressed sequence tags)  cDNAs provide   the population of genes being expressed; the mRNA’s relative abundance.

 reassociation kinetics: R 0 t 1/2 79

6.9.2 Serial Analysis of Gene Expression

 SAGE     Make cDNAs from the sample; Break the cDNAs into small fragments (10~14 nt) by restriction enzymes; Randomly ligate the fragments into longer DNA molecules that are cloned and sequenced; Use computer to recognize the original small fragments and compare them to known transcripts from the organism.

 Tags reflect the relative abundance of the corresponding transcript.

80

6.9.3 Microarrays

 high-density oligonucleotide arrays (HDAs) 81

6.10 Transposition

   DNA transposons insertion sequences retrotransposons 82

6.11 Repetitive Elements

tandemly repeated DNA  satellite DNA  5~200 bp repeat unit, millions of copies   minisatellite  < 25 bp microsatellite  < 4 bp 83

  interspersed repeat  LINE (long interspersed nuclear element)  SINE (short interspersed nuclear element) DNA finger print 84

6.12 Eukaryotic Gene Density

    3000 Mb human genome < 90 M (3%) corresponds to coding sequences.

810Mb (27%) are associated with introns, promoters, and pseudogenes.

2100 Mb (70%)   1680 Mb (56%) unique sequences 420 Mb (14%) repetitive DNA 85

參考資料及圖片出處

1.

2.

3.

Fundamental Concepts of Bioinformatics Dan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003. Biochemistry 2001. , by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, Biology , by Sylvia S. Mader, 8th edition, McGraw-Hill, 2003. 86