Transcript Slide 1

DEPARTAMENTO DE ESTATÍSTICA
Prof Hélio Magalhães de Oliveira, UFPE, 21/08/2013
1/2 × n-ário = 1 × (semi-n-ário). Visão Pessoal
TKS dr. Francisco Cysneiros
UNIVERSIDADE FEDERAL DE PERNAMBUCO
DEPARTAMENTO DE ESTATÍSTICA
Dados estatísticos sobre a vida
biológica: a aleatoriedade como
marca indelével no genoma das
espécies.
Prof. H. Magalhães de Oliveira
UFPE – AGO 2013
Escala Cronológica da Evolução da Vida
DNA – origem da vida: Uma cronologia (Battail, 2001)
O QUE É REALMENTE A VIDA?
Tendências estão derrubando as
barreiras entre o vivo e o não vivo.
• 1a mudança:
• 2a mudança:
Superação do vitalismo.
desaparecimento dos contornos
nítidos na distinção entre vivos e
não vivos
Seleção natural
– Darwinismo e Teoria da evolução
– O DNA / RNA
Propriedades características da vida natural
• Capacidade de reprodução
• Sensibilidade ao ambiente
• Metabolismo
• Singularidade química
• Alto grau de complexidade e organização
• Programação genética que dirige o
desenvolvimento
• Histórico modelado pela seleção natural
Dificuldades para definir a vida.
SEMENTES, estão vivas, mas não metabolizam
VIRUS, não se auto-reproduzem (vide mulas)
SALSICHAS não estão vivas, mas contém programa
genético, são feitas de proteínas e DNA
VIRUS DE COMPUTADOR, com propriedades da vida
biológica:
reproduzem-se,
são
sensíveis
ao
ambiente, metabolizam (consomem processamento,
memória),
podem
ser
usando seleção natural.
complexos,
sobrevivem
Fundamentos da Estrutura do DNA
• Os organismos vivos => células
• Procariontes vs Eucariontes
• As células dos eucariontescoordenação de todas as atividades: o núcleo
• Núcleo: DNA, contém a informação genética.
– transmissão da informação genética e
– síntese de proteínas.
DNA – Estrutura e Função
Bases nitrogenadas
Purinas
Pirimidinas
DNA – Estrutura
Ligação Fosfodiéster
DNA – Estrutura
Bases Complementares
1953: descoberta da estrutura do DNA
Watson & Crick: estrutura dupla
hélice do DNA
DNA – Estrutura e Função
Dupla Hélice
DNA – Duplicação
Ocorre na presença da DNA polimerase, que
rompe as pontes de hidrogênio entre as bases
nitrogenadas e as duas fitas do DNA se
afastam:
• Nucleotídeos livres existentes na célula
encaixam-se nas fitas, sempre em suas bases
complementares
• São formadas duas moléculas de DNA
idênticas.
•A duplicação do DNA é chamada
semiconservativa porque a molécula nova do
DNA tem uma fita nova e uma fita velha,
originária da molécula mãe.
Relação do Dogma Central
DNA
replicação
DNA
X
RNA
Retrovírus
tradução
Síntese Protéica
Síntese de Proteínas - Tradução
•A tradução ocorre nos ribossomas
•Trinca de bases do mRNA
códon
•Trinca de bases do tRNA
 anti-códon
Tradução
Nirenberg & Kohana
Síntese de proteínas
Mapping DNA into Proteins
The genetic source is characterized by a four-letter alphabet :
N={U, C, A, G}
Input alphabet N3={n1,n2,n3 | ni  N, i=1,2,3}
Output alphabet A:={Leu, Pro, Arg, Gln, His, Ser, Phe, Trp, Tyr, Asn, Lys,
Ile, Met, Thr, Asp, Glu, Gly, Ala, Val, Stop}
High redundancy map GC: N3 (|| N3 ||=64)  A (||A||=21)
O Código Genético
2a Letra
U
C
A
G
U
FENILALANINA
FENILALANINA
LEUCINA
LEUCINA
SERINA
SERINA
SERINA
SERINA
TIROSINA
TIROSINA
PARADA
PARADA
CISTEÍNA
CISTEÍNA
PARADA
TRIPTOFANO
U
C
A
G
C
LEUCINA
LEUCINA
LEUCINA
LEUCINA
PROLINA
PROLINA
PROLINA
PROLINA
HISTIDINA
HISTIDINA
GLUTAMINA
GLUTAMINA
ARGININA
ARGININA
ARGININA
ARGININA
U
C
A
G
A
ISOLEUCINA
ISOLEUCINA
ISOLEUCINA
METIONINA
(INÍCIO.)
TREONINA
TREONINA
TREONINA
TREONINA
ASPARAGINA
ASPARAGINA
LISINA
LISINA
SERINA
SERINA
ARGININA
ARGININA
U
C
A
G
G
VALINA
VALINA
VALINA
VALINA
ALANINA
ALANINA
ALANINA
ALANINA
AC. ASPÁRTICO
AC. ASPÁRTICO
AC. GLUTÂMICO
AC. GLUTÂMICO
GLICINA
GLICINA
GLICINA
GLICINA
U
C
A
G
1a Letra
3a Letra
• “A analogia me levaria a um passo adiante, isto é, à
crença de que todos os animais e vegetais
descendem de um protótipo único [...]
Todos os seres vivos têm muito em comum, em sua
composição
química,
em
suas
vesículas
germinativas, em sua estrutura celular e em suas leis
de crescimento e reprodução [...]
Provavelmente todos os seres orgânicos que tenham
em qualquer ocasião vivido nessa Terra, descendem
de alguma forma primordial única, na qual a vida
primeiro respirou. ... De um começo tão simples,
formas infindáveis, as mais belas e as mais
maravilhosas, evoluíram e estão evoluindo.”
CHARLES DARWIN (1859)
On the Origin of Species
DNA: Similaridades
• Similaridade entre DNA de humanos:
• 99 a 99,1%
• Similaridade humanos - chimpanzés:
• 98,5%
• Somente ~2 % do genoma humano
codifica proteínas:
• 3.109 bp -> 120 Mb/(8b/B)=15MB
O homem é mais próximo do gorila ou do orangotango?
Comparação do DNA mitocondrial
• homem
• ATA ACC ATG CAC ACT ACT ATA ACC ACC CTA ACC CTG ACT
TCC CTA ATT CCC CCC ATC CTT ACC CTC GTT ACC ...
• gorila
• ATA ACT ATG TAC GAT ACC ATA ACC ACC TTA GCC CTA ACT
TCC TTA ATT CCC CCT ATC CTT ACC TTC ATC ACT ...
• orangotango
• ACA GCC ATG TTT ACT ACC ATA ACT GCC CTC ACC TTA ACT
TCC CTA ATC CCC CCC ATT ACC GCT CTC ATT AAC ...
1953: primeira seqüência de aminoácidos
Sanger: seqüência de aminoácidos
da insulina bovina
MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFYTP
KARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQLENYCN
Representações Alternativas para o
Código Genético
– Inner-to-outer map
– 2D-Gray genetic map,
– genetic world-chart representations
•
DE OLIVEIRA, H.M.,SANTOS-MAGALHÃES, N.S., The Genetic Code
revisited: Inner-to-outer map, 2D-Gray map, and World-map Genetic
Representations, 11th International Conference on Telecommunications,
August 1-7, Fortaleza, Brazil, ICT2004, 2004, submetido.
•
SANTOS-MAGALHÃES, N.S., BOUTON, E.A., DE OLIVEIRA, H.M., How
to Represent the Genetic Code?, Reunião Anual da Sociedade Brasileira
de Bioquímica, SBBq, 2004, submetido.
The Inner-to-outer Map
First nucleotide: inner circle
Second nucleotide: surrounding
Third nucleotide: outer region
Inner-to-outer map for the genetic code
Homofonemas
Modem 64-QAM
011111
011101
010101
010111
011110
011100
010100
010110
110110
010000
010010
110010
011010
011000
011011
011001
-7
-5
-3
-1
1
001011
001001
000001
000011
000010
001010
001000
010001
000000
001110
001100
000100
001111
001101
000101
010011
110111
000110
000111
110011
110101
111101
111111
110100
111100
111110
111000
111011
110000
110001
111001
111011
3
5
7
100011
100001
101001
101011
100010
100000
101000
101010
100110
100100
101100
101110
100111
100101
101101
101111
de Oliveira
U [11]; A  [00]; G  [10]; C  [01].
bacteriophage FX174: Each binary codeword belongs
to a constant weigh code.
DNA Codeword
G...C 01
10
A...T 00
11
G...C 01
10
T...A 11
00
T...A 11
00
T...A 11
00
T...A 11
00
A...T 00
11
T...A 11
00
G...C 01
10
Representação 2D-Gray
de Oliveira, Santos Magalhães 2004
Código Genético:
Mapeamento dos aminoácidos
Santos Magalhães, E.Bouton, de Oliveira 2004
Coloured 2D-Gray genetic map
Val
Ile
Ile
Thr
Thr
Ala
Ala
Val
Val
Ile
Ile
Phe
Leu
Leu
Pro
Pro
Ser
Ser
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Pro
Pro
Ser
Ser
Leu
Leu
Leu
Leu
Trp
Arg
Arg
Gln
Gln
Stop
Stop
Stop
Trp
Arp
Arg
Cys
Arg
Arg
His
His
Tyr
Tyr
Cys
Cys
Arg
Arg
Gly
Ser
Ser
Asn
Asn
Asp
Asp
Gly
Gly
Ser
Ser
Gly
Arg
Arg
Lys
Lys
Glu
Glu
Gly
Gly
Arg
Arg
Val
Met
Ile
Thr
Thr
Ala
Ala
Val
Val
Met
Ile
Val
Ile
Ile
Thr
Thr
Ala
Ala
Val
Val
Ile
Ile
Phe
Leu
Leu
Pro
Pro
Ser
Ser
Phe
Phe
Leu
Leu
Coloured Genetic code map for amino-acids
This representation merges regions mapped into the
same amino-acid !
Terra de Nirenberg-Kohana:
Continentes
Continents of Niremberg-Kohama's Earth: regions of
essential amino acid corresponds to the land and
nonessential amino acids constitutes the ocean.
Éxons  Íntrons
http://www.dnalc.org/resources/3d/rna-splicing.html
Eliminando os íntrons na transcrição
Trecho de DNA da b-hemoglobina humana
(reading frames)
• ...ACA GAC ACC ATG GTC CAC CTT GAC...
• . .. CAG ACA CCA TGG TGC ACC TGG...
•
... AGA CAC CAT GGT GCA CCT TGA ...
Genes da sub-unidade b da hemoglobina
(2 genes)
B
A
90 bp
131 bp
222 bp
851 bp
126 bp
Porção do DNA do genoma do HIV-1
• GGG TTC TTG GGA GCA GCA GGA
AGC ACT ATG GGC GCA ...
• O câncer é causado por agentes
(carcinógenos, radiação, vírus) que
danificam o DNA, ou interferem nos
seus mecanismos de replicação e/ou
reparo.
Análise genômica
Espectro para localização de Éxons
(Gene F56F11.4)
Análise wavelet de
seqüências genômicas
Oncogênio c-myb (galinha)
8.200 bp
b-cardíaco humano
6.000 bp
Genoma Music - Body Music
Susumo Ohno
URL- http://www.toshima.ne.jp/~edogiku/FlaMovIntro/
DNA do bacteriófago fX174
• 5.386 bp - 10 genes (A até K)
Gene
n. de aminoácidos
quadro
A
B
C
D
E
F
G
H
J
K
455
120
86
152
91
427
175
328
38
56
(1539 bp)
(360 bp)
(258 bp)
(456 bp)
(273 bp)
(1281 bp)
(525 bp)
(984 bp)
(114 bp)
(168 bp)
5.958 bp
2
1
1
3
1
2
1
3
2
3
Genes no DNA do bacteriófago fX174
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAG
TGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTG
GATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGA
TTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGG
CTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATG
GTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGG
AAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACT
GACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTA
CTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGC
GCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGA
CGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGAT
TAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCC
TAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTAT
GGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAA
GCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTT
TACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGC
CGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATA
CCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGT
CAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGC
CACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATG
ACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGT
TTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGC
TACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTT
GGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATG
GTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACC
GCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTT
CTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGG
TGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGG
TACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTG
CATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAAT
CAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGC
TTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTC
AAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCT
CATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCT
GGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTA
TGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCC
AATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCC
ATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGA
CCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTAT
AATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACA
GGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGG
TCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATG
CGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTAC
AGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGG
TTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAA
GAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACG
CCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCA
AATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA
Tamanho de Genomas
• Menor número de genes
Mycoplasma genitalium 470 genes
• Genoma humano
Homem ~120.000 genes
(pensava-se erroneamente!)
bacteriófago fX174
ORDEM DE MAGNITUDE DE GENOMAS
(pares de bases = bp)
Vírus
bactérias
Levedura
nematóide
insetos
mosca da fruta
mamíferos
Peixe pulmonado
mostarda de erva daninha
Pinheiro
amoebia dubia
10 kbp (SV40 5k, T2 48.6 k...)
4 Mbp (E. coli 4.7 Mb)
9 Mbp
90 Mbp
0.2 - 7.5 Gbp
180 Gbp
1.4 - 5.7 Gbp (man 3.2 Gbp)
140 Gbp
200 Mbp
68 Gbp
670 Gbp
PARADOXO DO ‘valor C’
• Valor C =
Quantidade de DNA no Seu genoma haploide
• Muitos organismos menos complexos possuem valores
C surpreendentemente elevados.
• O DNA “extra” tem função?
Senão, por que é preservado de geração para geração?
Gene
•
b-globina humana
doença
anemia falciforme
• Fator VIII humano
hemofilia
• Proteína kinase
distrofia muscular
comprimento
2.000 bp
200.000 bp
3.407 bp
A identidade das coisas vivas fornecida pelo
substrato genético, parece válida a hipótese
“species are sparse” (Battail).
• N. de espécies vivas na Terra ~ 107
Admita que estas sejam uma fração de 1/100 das
que existiram (extinção)
Tem-se ~109 espécies (aparentemente grande...)
Isso é ridiculamente pequeno com respeito ao n. total
de possíveis genomas na ausência de redundância
GENOMAS ~ 4^109 ~10100000000
(para um genoma típico de 109 nucleotídeos)
Pequena Cronologia de Genomas
• 1977 Seqüenciamento completo
genoma do fago fX174 (5.386 bp)
• 1995 Primeiro organismo vivo
Genoma do Haemophilus influenzae (1,8 Mbp)
• 1996 Saccharomyces cerevisiae (12,1 Mbp)
• 1997 Escherichia coli (4.6 Mbp)
• 1998 Primeiro animal –nematóide
Genoma do caenorhabditis elegans (97,1 Mbp)
• 1999 Primeiro cromossomo humano
Cromossomo 22 (33,4 Mbp)
• 2000 Drosophila melanogaster (120 Mbp)
• 2000 Cromossomos 5, 16, 19, 21
• 1988-2000 Human Genome Project
• June 2000 – milestone draft sequence
"Tudo está nos genes"... Ou não!
• Durante muito tempo, a genética resumiu-se
a esse paradigma. De fato, depois da
descoberta da estrutura do DNA, um
esquema passou a prevalecer:
• A estrutura do DNA é similar a um programa
de computador no qual o gene, ao codificar
proteínas, determina a aparência dos
organismos vivos e governa a maioria dos
seus comportamentos.
Reducionaismo:
Alerta Andras Paldi (CNRS).
• O temendo reducionismo dos pesquisadores
genéticos acaba considerando o ser vivo como
uma adição estrita de elementos justapostos.
• Ao estabelecer um catálogo das proteínas
corremos o risco de agravar o problema.
É como se tentássemos entender o
funcionamento de um foguete lendo o
catálogo das suas peças!
Of Protein Size and Genomes
NEREIDE S. SANTOS-MAGALHÃES, HÉLIO M. DE OLIVEIRA
Of Protein Size and Genomes
NEREIDE S. SANTOS-MAGALHÃES, HÉLIO M. DE OLIVEIRA
WSEAS TRANS. ON BIOLOGY AND BIOMEDICINE
Issue 2, Vol.3, February 2006 ISSN: 1109-9518
~250 academia downloads
number of genes? (in living organisms)
1) bacterial genomes; number of genes ~= genome size kbp.
bacterial proteins reveals 350 amino acid residues as typical.
2) C. elegans genome of 99 Mbp and genomic rate 25%.
Its protein size distribution has an average polypeptide length
of 469 amino acids.
• human proteins;
serum albumin has 609 amino acid residues,
collagen about 1,000,
apolipoprotein B 4,536,
human Titin 26,926.
A DNA code is specified by the triplet DNA(C,R,d),
where
C is genome size (bp),
R is genomic rate
d is coding density (genes/bp).
number of protein-coding base pairs
R=
total number C of base pairs of the genome.
g  C /
Further DNA parameters:
g is the number of genes of the genome,
e is the average number of ‘exons’ per gene.
coding density: estimated in terms of the expected protein size
bp/gene
• average bacterial protein ~300 amino acids long,
• genomic bacterial rate ~ 0.8 to 0.9.
Bacteria usually have a coding density d 1,000 bp/gene
number of genes for bacteria: gC/1,000
(this is striking confirmed at
http://www.cbs.dtu.dk/services/GenomeAtlas/
http://www.cbs.dtu.dk/services/GenomeAtlas-2.0/show-databas
0.45
0.40
# genes (%)
0.35
0.30
0.25
Lambda
0.20
phiX174
0.15
0.10
0.05
0.00
0
200
400
600
800
1000 1200 1400
protein length (aa residues)
protein size histograms (straightforward organisms),
FFX174 and the  phage l viruses
C. elegans
0.25
# genes (%)
0.20
0.15
0.10
0.05
0.00
0
500
1000
1500
2000
protein length (aa residues)
2500
The coding density of different chromosomes of lower eukaryotic
species is roughly the same, i.e. slight fluctuations from one
chromosome to another in the same organism.
The C=12,057,849 bp, g=6,268 genes) has an average coding
deS. cerevisiae (nsity 1,947 bp/gene -- 15 chromosomes.
S. cerevisiae
Chr1
2,093
Chr9
1,864
Chr2
1,918
Chr10
1,906
Chr3
1,855
Chr11
1,960
Chr4
1,870
Chr12
1,989
Chr5
2,090
Chr13
1,841
Chr6
2,144
Chr14
1,854
Chr7
1,891
Chr15
1,908
Chr8
2,017
average
1,947 bp/gene
(from http://www.cbs.dtu.dk/services/GenomeAtlas
The coefficient of variation (CV %) of the coding density is 5.06 %
The six chromosomes of the C. elegans
(C=98,971,533 bp, g=17,585 genes) present an
average coding density of 5,731 bp/gene.
ChrI
ChrII
ChrIII
ChrIV
ChrV
Chr X
average
C. elegans
5,072
5,592
5,771
6,312
4,899
6,740
5,731 bp/gene
(from http://www.cbs.dtu.dk/services/GenomeAtlas
The coding density barely varies from one chromosome to another
The coefficient of variation (CV %) of the coding density is 1.72 %
DNA parameters for some well-known genomes,
•
•
•
•
•
•
•
•
•
virus FX174
microbial M. genitalium
H. pylori
H. influenzae
S. Aureus
B. subtilis
M. tuberculosis
E. coli
X. fastidiosa
Organism
FX174
 bacteriophage
M. genitalium
H. pylori
H. influenzae
S. aureus
B. subtilis
M. tuberculosis
E. coli
X. fastidiosa
S. cerevisiae
C. elegans
D.melanogaster
180 Mbp
Human (old)
~3,000 Mbp
genome
size C
(Mbp)
coding
density
number of
genes
genomic
rate
 (bp/gene)
g
R
0.0054
0.0485
0.58
1.67
1.83
2.80
4.21
4.41
4.64
2.52
12.06
99
~60*
538
683
1,208
1,066
1,071
1,069
1,025
1,126
1,082
1,238
1,924
5,628
 ~ 13,235
10
71
480
1,566
1,709
2,619
4,106
3,918
4,289
2,034
6,268
17,585
1.00
0.95
0.90
0.89
0.86
0.84
0.87
0.97
0.87
0.78
0.70
0.25
120
' ~ 8,823
13,600
0.13
1,000*
 ~ 30,000
2,000
' ~20,000
Human (update)
967*
~112,500
~2,900 Mbp
1,933
 ~75,000
average
protein
length
genomic
redundancy
information
1-R
(Mbits)
(%)
180
216
363
316
307
299
297
364
314
322
450
469
0.01
0.09
1.04
2.97
3.15
4.70
7.32
8.56
8.08
3.93
17.3
49.5
~0
5
10
11
14
16
13
3
13
22
30
75
573
46.8
87
100,000?
~0.03
~300?
~180.0?
~97?
~25,800
~0.016
~600
~92.9
~98.4
1) unsuccessful attempt to explain the complexity of living beings:
• the genome length.
The so-called C-value paradox proved that this is incorrect.
2) The number of genes was supposed to be related to complexity.
• people to expect more genes than human actually have.
•about 100,000 widespread in 80’s and late 90’s
3) A potential measure that correlated with the complexity
• average protein size.
storing all genes of a single human require less than 10 MB
(albeit the entire the human DNA sequence requires about 1 GB)
Let C’ and d’ denote, the genome size and the coding density
with the exception of highly repetitive sequences.
About one third of high eukaryotic DNA corresponds to these sequences
which are not transcribed, but may have structural properties.
Therefore, C’=2C/3 and d’=2d/3.
The superscript “prime” refers to the expurgated genome,
i.e. highly repeated sequences apart.
expected gene distribution in the 23 human chromosomes
chromosome
Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
Chr18
Chr19
Chr20
Chr21
Chr22
Chr X
length
(bp)
226,828,929
205,000,000
195,073,306
115,000,000
117,696,509
169,212,327
310,210,944
143,297,300
117,790,386
132,016,990
130,908,954
129,826,379
90,000,000
87,191,216
81,992,482
79,932,432
79,376,966
74,658,403
55,878,340
59,424,990
33,924,367
34,352,072
152,118,949
predicted genes
(unveiled genes)
2,016
1,822 (1,346)
1,734
1,022 (796)
1,046 (923)
1,504 (1,557)
1,367a (1,150)
1,274
1,047 (1,149)
1,173 (816)
1,163
1,154
800 (633)
775 (1,050)
729
711 (880)
705
663
497b (1,461)
528 (727)
301c (225)
305 (545)
1,352 (1,098)
gene
gene distribution in human chromosomes:
• Genome size C=2,881 Gbp;
• Number of genes g=22,525.
The genes mean size
Human karyogram
(bp) in each chromosome is:
g
I ene
E
Chrom.
C
number
(bp)
genes&
pseudo (only
genes)
2,585
Chr2
[27]
Chr4
237,000,000
[27]
Chr6
186,000,000
[28]
Chr9
166,800,000
[29]
Chr10
109,044,351
[30]
Chr13
131,666,441
[31]
Chr14
95,500,000
[32]
Chr20
87,410,661
[33]
Chr22
59,187,298
[34]
34,491,000
e
(bp)
--
(bp)
--
(1,346)
1,574
--
(kbp)
5.30
33.8
6.60
34.3
--
(796)
2,190
(1,557)
318
7,208
5.28
32.5
342
6,799
5.77a
34.4
322
7,817
5.84
39.7
320
9,164
5.20
40.2
295
8,194
6.35a
45.7
292
5,170
6.00
27.2
266
4,037
5.40
19.2
1,575
(1,149)
1,357
(816)
929
(633)
1,443
(1,050)
895
(727)
679
(545)
Cromossomas humanos: Comprimentos médios
the average number of amino acid residues (
L
L
) and the genomic rate (R) are shown.
average number of amino acid residues ( L)
genomic rate (R)
Chrom.
Chr6
Chr9
Chr10
Chr13
Chr14
Chr20
Chr22
number
(aa)
560
658
627
555
624
584
479
1.56
1.79
1.17
1.10
2.36
2.15
1.82
R
(%)
CONCLUSIONS
• average length of ‘exon’ about 300 bp,
• average length of ‘intron’ about 6,900 bp,
• mean of about 6 exons/gene
• (from single-exon genes to 175 exon for the Titin gene!)
• average number of residues for coded-proteins ~ 600 aa.
************
average protein size as a worthy criterion for assessing life complexity.
DNA-Error Control Code May Be Unstructured
H. M. DE OLIVEIRA, N.S. SANTOS-MAGALHÃES
The astonishing reliability by which deoxyribonucleic acid (DNA) has
been preserved through ages implies that cell’s replication machinery
have to ensure against copying mistakes.
The replication machine is self-correcting and operates with a mean of 1
error per 107 nucleotides copied. Around 99% of such errors are
corrected by the DNA mismatch repair mechanism, resulting 1 error per
109 nucleotides copied.
Introns & exons
most eukaryotic genes have their coding sequences interrupted
by noncoding regions
(the so-called introns, for intervening nontranscribed sequences).
‘Introns’ are usually longer than the ‘exons’.
INTRONS: size ranging from 20 bp, to 250,000 bp;
EXONS: size ranging from 50 to 600 bp (average 300 bp).
attempts in understanding the biological role of ‘introns’:
no recognized functions were found.
Highly repetitive sequences:
SINES (short interspersed elements) 13% of the genome,
LINES (long interspersed elements.) 21% of the genome.
Repetitive DNA has commonly been regarded as “junk-DNA”,
noncoding DNA: ‘introns’, 26% of the human genome.
 Viruses and bacteria have a high fecundity and few gene families;
 have little or almost no need for protection.
 Plants and animals have high permanency.
=> Must be robust to mutations (survivors of natural selection)
Standard error correcting codes
designed by imposing constraints on the sequences.
Why using structured codes? Answer :
(mislead) belief that the decoding of random code is unfeasible.
Due to the lack of structure => an exhaustive search.
We think that Darwinian mechanisms for protecting DNA
may be quite different.
No parity rules should be looked for! (HMdO)
we believe : ‘introns’ were the spontaneous
mechanism of introducing uncertainty.
 In a battle, a crucial payload is to be sent to the front.
If the only way is sending it through the battlefield, it
should not be directly dispatched. Many fake-cargos
could be added, and the relevant one will be hidden
among them.
If the enemy (noise, mutation) hardly tries to intercept
this crucial delivery, he can now probably not succeed
due to the amount of uncertainty added to the
process. Many ineffective cargos (junk-cargos or
‘introns’) will be hit, but the main one will probably be
missed.
 same strategy used in the safeguard of authorities
such as Presidents of some nations (to include
uncertain routes and second self.)
DNA coding has trivial decoding scheme
(asynchronous start-stop protocol).
 DNA code meet Battail’s close-to-random criterion
 Biological evolutionary codes match Shannon's paradigm:
they are long truly random codes.
We quote Battail:
“Nature appears as an outstanding engineer…”
ARREMATE:
Este seminário é essencialmente uma provocação!
Se a Estatística lida com grandes massas de dados
(dados já disponíveis), com comportamento
inerentemente aleatório, as bases de dados de
Genomas, disponíveis publicamente, são fonte
de desafio para excelentes trabalhos e descobertas
Obrigado...
[email protected]
http://www2.ee.ufpe.br/codec/deOliveira.html
Ácidos Ribonucléicos - Tipos