Gibson Second Edition

Download Report

Transcript Gibson Second Edition

Genome Science

Ka-Lok Ng Dept. of Bioinformatics Asia University

The Core Aims of Genomics Science

(1) An integrated web-based database and research interface access to the enormous volume of data web interfaces Relational databases Generic Model Organism Database (GMOD) project http://www.gmod.org/  to develop reusable components suitable for creating new community databases of biology

The Core Aims of Genomics Science

(2) To assemble physical an genetic maps

location of genes in a genome physical distance and relative position defined by recombination frequencies the map is crucial for comparing the genomes of related species related phenotypic and genetics data used in animal and plants breeding extend to more species with greater accuracy

The Core Aims of Genomics Science

(3) To generate and order genomic and expressed gene sequences

High-volume sequencing Basic technique is developed by Fred Sanger “Shotgun” approach chromosomes  assemble into contigs, scaffolds (a set of contigs), then the whole mRNA is unstable Coding parts  cDNA clones – cloned from mRNA transcripts Expressed sequence tags (ESTs) Obtain full length cDNA is not easy  because of mRNA structure

The Core Aims of Genomics Science

(3) To generate and order genomic and expressed gene sequences mRNA

cDNA

EST

Whole genome reconstruction Reverse transcription  cDNA EST - partial cDNA sequences sequenced either from 5' or 3‘ Alternative splicing  not a one-to-one correspondence between ESTs and genes

The Core Aims of Genomics Science

(4) Identify and annotate within a genome the complete set of genes encoded

From complete sequence of a genome 

Alignment

genes identification of cDNA, DNA and protein sequences – BLAST

Gene finding software

– ORFs, transcription start and termination sites, exon/intron boundaries Then gene proteins

annotation

 linking sequence to genetic function, expression, locus information, comparative data from homologous

The Core Aims of Genomics Science

(5) To characterize DNA sequence diversity

Single-nucleotide polymorphisms (SNPs) About 90 percent of human genome variation comes in the form of single nucleotide polymorphisms (neither harmful nor beneficial) Theoretically, a SNP could have four possible forms, or alleles (different seq. alternative), since there are four types of bases in DNA. But in reality,

most SNPs have only two alleles

. For example, if some people have a T at a certain place in their genome while everyone else has a G, that place in the genome is a SNP with a

T allele and a G allele

. The human genome contains more than 10 million SNPs  once in every 100 to 300 bp !

Find associations between SNP variation and phenotypic variation

e.g. Sickle-cell anemia

鐮刀狀細胞貧血症 SNP mutation

Sickle-cell anemia and SNP

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RFLPs.html

The Core Aims of Genomics Science

(5) To characterize DNA sequence diversity

Characterize the level of haplotype structure due to linkage disequilibrium (LD) haplotype = a set of adjacent polymorphisms found on a single chromosome LD = groups of closely linked alleles that tend to be inherited together accurately , can be used to map human disease genes very Knowledge of LD are utilized to do disease locus mapping In the human genome, haplotypes tend to be approximately 60,000 bp in size and therefore contain up to 60 SNPs that travel as a group. Haplotype

The Core Aims of Genomics Science

Mendel's Laws enable the outcome of genetic crosses to be predicted. A and B on different chromosome

The Core Aims of Genomics Science

Genes on the same chromosome should display linkage.

Genes A and B are on the same chromosome and so should be inherited together and C. . Mendel's Second Law should therefore not apply to the inheritance of A and B, but holds for the inheritance of A and C, or B Mendel did not discover linkage because the seven genes that he studied were each on a different pea chromosome.

Partial linkage

Partial linkage was discovered in the early 20th century. The cross shown here was carried out by Bateson, Saunders and Punnett in 1905 with sweet peas. The parental cross gives the typical dihybrid result (see Figure on the right ), with all the F1 plants displaying the same phenotype, indicating that the dominant alleles are purple flowers and long pollen grains. The F1 cross gives unexpected results as the progeny ( 後裔 ) show neither a 9 : 3 : 3 : 1 ratio (expected for genes on different chromosomes) nor a 3 : 1 ratio (expected if the genes are completely linked). An unusual ratio is typical of partial linkage

The Core Aims of Genomics Science

(5) To characterize DNA sequence diversity

the farther apart two genes are, the more they tend to assort independently (randomly)  recombination frequency ↑ Higher freq.  farther apart Vermilion 朱紅色

The Core Aims of Genomics Science

(6) To compile atlases of gene expression

analyzing profiles of transcription and protein synthesis traditional method: Northern blots, hybridization modern technology – microarray relative level of expression (differential expression) patterns of covariation in gene expression  clues to unknown gene function (guilt by association)

The Core Aims of Genomics Science

(7) To accumulate functional data, including biochemical and phenotypic properties of genes Near-saturation mutagenesis

mutants (screening hundreds of thousands of to identify genes that affect traits as diverse as embryogenesis, immunology, and behavior)

high-throughput reverse genetics

(methods to systematically and specifically inactivate individual genes ). Yeast Genome Deletion Project http://www sequence.stanford.edu/group/yeast_deletion_project/deletions3.html

Mouse http://www.bioscience.org/knockout/knochome.htm

Proteomics

interactions – detecting protein expression and protein-protein

Pharmaco

genomicists – study the interactions between small molecules (i.e. potential drugs) and proteins

Functional genomics

model organisms – a crucial component is to study various

Clone library

vector – collections of DNA fragments that are cloned into a

The Core Aims of Genomics Science

With

Smith's site-directed mutagenesis

the researchers can study in detail how proteins function and how they interact with other biological molecules. Site-directed mutagenesis can be used,

for example, to systematically change amino acids in enzymes

, in order to better understand the function of these important biocatalysts. The researchers

can also analyze how a protein is folded into its biologically active three dimensional structure .

The method can also be used to study the complex cellular regulation of the genes and to increase our understanding of the mechanism behind genetic and infectious diseases, including cancer.

GTC  Valine GCC  Alanine

Site-directed mutagenesis

The Core Aims of Genomics Science

(8) To provide the resources for comparison with other genomes.

Comparative maps from one species to be used in the other species  allow genetic data Comparative maps along a chromosome tends to be conserved  Synteny (human and mouse genome)  local gene order Even without synteny, the conservation of gene function is known (say from fly to primate 靈長類動物 ) Gene order conservation (GOC)

Mapping Genomes – Genetic Maps Genetic map – the

relative order of genetic markers in linkage groups

in which the distance between markers is expressed as units of recombination

Genetic markers

– sequences tags, repeats, restriction enzyme polymorphism (cutting sites) In diploid ( 具兩套染色體 ) organisms, genetic maps are assembled from data on the co-segregation ( 同時分離 ) of genetic markers either in pedigrees ( 家譜 ) or in the progeny ( 後代 ) of controlled crosses.

•Genetic distance unit •In human 1cM = 1% of recombination frequency •Human, 1cM ~ 1Mbp •100 cM •Markers on different chromosomes have a 50-50 chance of co-segregation   

centriMorgan (cM)

1 crossover occurs per chromosome per generation 50cM (0.5 crossover occurs per generation)

Mapping Genomes – Genetic Maps (A) A pair of different parental chromosomes (green and blue colors).

(B) A table showing the frequency of recombinants between each marker. Larger number indicates that the genes are farther apart.

(C) The most likely genetic map from the entire data. In this hypothetical example, two linkage groups are inferred, the top one is longer than 50 cM.

Genetic distance ~ 0.11  0.22  21cM, 0.25  11cM 24cM, 0.33  33cM

Figure 1.1

Mapping Genomes – Genetic Maps • • • Software of the assembly of genetic maps http://linkage.rockefeller.edu/soft/list.html

Multiple factors lead to

high variation

in the correspondence between physical and genetic distances There is

variability of recombination rate

along a chromosome (centromeres and telomeres are less reconbinogenic than general euchromatin) 

hot spots and cold spots

of recombination

Exercise 1.1 (Part 1) Constructing a genetic map

Constructing a genetic map - four recessive loci – thickskin, reddish, sour, petite.

After identifying two true-breeding trees that are either completely wild-type or mutant for all four loci, the breeder crosses them, and then plants an orchard of F generation) trees. 2 (second

Q.

Based on the following frequencies of mutant classes,

determine which loci are likely to be on the same chromosome and which are the most closely linked.

Exercise 1.1 (Part 2) Constructing a genetic map

Assume independent assortment for each recessive phenotype

¼

242 petite (127+42+38+12+10+8+3+2), 249 reddish, 247 sour and 236 thickskin Expect that unlinked loci would segregate independently

~ 60 trees (that is 1/4*1/4*968) produced each double mutants class

Exercise 1.1 (Part 2) Constructing a genetic map Mapping Genomes – Genetic Maps Exercise 1.1 Constructing a genetic map four recessive loci –

thickskin, reddish, sour, petite

Q. Determine which loci are likely to be on the same chromosome and which are the most closely linked.

Answer: Total number of 968 trees. Assume independent assortment for each recessive phenotype  ¼  242 petite, 249 reddish, 247 sour and 236 thickskin Expect that unlinked loci would segregate independently  ~ 60 trees (that is 1/4*1/4*968) produced each double mutants class

Exercise 1.1 (Part 2) Constructing a genetic map

Mapping Genomes – Genetic Maps

s r t Approximate solution p

Mapping Genomes – Physical Maps Physical maps • is an assembly of contiguous stretches of chromosomal DNA – contigs – in which the distance between landmark sequences of DNA is expressed in kilobases • the ultimate physical map is the complete sequence

Applications

(1) provide a scaffold upon which polymorphic markers can be placed (2) facilitating

finer scale

linkage mapping (3)

confirm linkages

inferred from recombination frequencies (4) resolve ambiguities about the

order

of closely linked genes (5) enable detailed

comparisons of regions of synteny

between genomes

Mapping Genomes – Physical Maps Two strategies used to assemble contigs • • • • (1) Alignment of randomly isolated clones based on shared restriction fragment length profiles YAC BAC – ~1Mbp long fragments – ~100kbp long fragments Plasmid – ~ kbp long fragments Automatic restriction profiling (Ch. 2)  assemble contigs (short for "contiguous sequences").

Genomic clone library Unlike the case of f X174, no large genome could be completely sequenced without

an extra round of fragmentation into manageable sized chunks

. In other words it had to be transferred into one or more clone libraries from which individual clones were picked to be

"subcloned"

in M13 for sequencing.

The general outline of the procedure is shown at right. You can see that f X174 bypassed the first stage, the construction of a clone library from the target genome.

cDNA library

– made from RNA that has been reverse transcribed into cDNA and are used for EST sequencing projects.

Cloning vectors

Mapping Genomes – Physical Maps (2) Hybridization-based approaches –

chromosome walking

Chromosome walking is used as a means of

finding adjacent genes

(positional cloning), or parts of a gene which are missing in the original clone as well as to analyze long stretches of eukaryotic DNA. This task requires finding a set of overlapping fragments of DNA that spans the distance between the marker and the gene.

Genomic DNA is shown in blue . Selected clones from a library of cloned genomic DNA fragments are shown in red . The initial probe, probe a, is specific to gene A or exon A and allows identification of clones 1 and 2. A new probe, probe b, is prepared from one end of clone 2 and used to isolate new clones 3 and 4 from the genomic library. Probe c, prepared from clone 4 is used to identify clone 5, etc. The orientation of the clones is determined by restriction mapping of the clones. Clone 6 contains the desired gene B or exon B.

Mapping Genomes – Cytogenetic Maps Historically – aid in the alignment of physical and genetic maps Cytogenetic maps are the banding patterns observed through a microscope on stained chromosome spreads Traditional preparation – salivary gland polytene chromosomes 唾液腺多線染色體 (greatly enlarged relative to their usual condition) of insects and Giemsa banded mammalian metaphase karyotypes http://book.tngs.tn.edu.tw/database/scientieic/content/1970/0 0100010/images/0053b.jpg

Chromosomes  the genetic material  phenotypes or medical conditions correlate with the deletion or rearrangement of chromosome sections Cytogenetic map are aligned with the physical map through in situ ( 在原位置 ) hybridization – a clone fragment is annealed to a single location on the cytogenetic map NCBI Genomic Biology http://www.ncbi.nlm.nih.gov/Genomes/ Keyword: HOX AND homo[ORGN] Karyotypes

Mapping Genomes – Cytogenetic Maps Alignment of cytological, physical, and genetic maps.

Cytological map

– a representation of a chromosome based on the pattern of staining of bands

Physical map Genetic map

– the location of transcripts and sites of insertions and deletions – recombination rates vary along a chromosome, typically reduced near the telomere and centromere

Distances between

genetic, physical and cytological

markers

are

not uniform How to search for genes on a genome map ? See my lecture notes on Bioinformatics class.

Comparative Genomics

Synteny

– conservation of gene order between chromosome segments of two or more organisms.

Homologes

– highly conserved loci derived form a common ancestral locus

Orthologs

– similar genes that arose as result of duplication subsequent to an evolutionary split

Paralogs

– similar genes that arose as result of duplication speciation •

Conservation of gene order

is an

inverse function

of the times

since divergence

from the ancestral locus.

• Note –

rates of divergence vary

considerably at all taxonomic levels.

• Japanese pufferfish – 7.5 times smaller than the human genome, show extensive gene order similarity with humans, around 50% - 80% is in the same order as is found in the human genome

Comparative Genomics

1. Chromosome painting

– used to define regions of

Synteny cover regions

(~0.1 of a chromosome arm) 2. Each chromosome of one species is labeled with a set of fluorescent dyes , and hybridized to chromosome spreads of the other genome.

3. Uses the fluorescent in situ hybridization (FISH) technique to detect DNA sequences in metaphase spreads of animal cells. The fluorescently labeled hybrid karyotype is shown in bottom.

Comparative Genomics

Synteny between cat and human genomes.

Ideograms ( 染色體模式圖 ) for each of the 24 chromosomes shown on the

right

in each pair are aligned against color-coded representations of corresponding cat chromosomes.

CAT – six groups (A – F) of 2 – 4 chromosomes each.

Top row

– 12 autosomes that are

essentially syntenic along

, except for some rearrangements

Bottom row

– 10 autosomes that have

at least one major rearrangement The two sex chromosomes

are

essentially syntenic

between cat and human

Comparative Genomics

• • • • •

Sequence conservation = functional importance

High-resolution comparative physical mapping – found ~1Mbp

synteny region between human and mouse

May contain hundreds of genes,

local inversions

involving one or a few genes and

Families of genes

organized in tandem clusters Considerable size

variation in intergenic “junk” DNA insertions/deletions

Comparative Genomics

• Identifying genes and regulatory regions in seq. genomes is challenging • ORF are usually good

Comparative Genomics • Identifying genes and regulatory regions in sequenced genomes is challenging • Open reading frames (ORFs) are usually good indication of genes • However, it is difficult to determine which ORFs belong to a gene – Many mammalian genes have small exons and large introns • Regulatory sequences even more difficult

Comparative Genomics • Computer programs analyze genomic sequence – GRAIL – GeneFinder • Look for ORFs, splice sites, poly A addition sites, etc.

• Predict gene structure • Frequently wrong – Usually miss exons at beginning or end of gene – Sometimes predict exon when one doesn ’ t really exist

Comparative Genomics • When comparing genomes of different species, the genes normally have the same exon – intron structure • Look for conserved ORFs in both genomes • Frequently permit accurate identification of genes – Fugu – human comparison found >1,000 genes – Mouse – human comparison indicates only 25,000 genes in genome

Example of sequence comparison • Comparison of the human and mouse spermidine synthase genes revealed an additional intron in the human gene that is not found in the mouse homologue

Human Mouse

5,500 bp

The Human Genome Project (HGP)

Objectives

1. Generation of

high-resolution

genetic and physical

maps localization of disease-associated genes

.

that will

help in the

2. The attainment of sequence benchmarks, leading to generation of a

complete

genome sequence by the year 2005. (A draft version was achieved in May 2000, but finished sequence required

an error rate of less than 1 in 10,000 bp

)

3. Identification of each and every gene in the genome

by a combination bioinformatics identification of open reading frame (ORFs), generation of voluminous EST databases, and collation( 對照 )of functional data including comparative data from other animal genome projects.

4. Compilation of exhaustive

polymorphism databases

, in particular of

SNPs

, to facilitate integration of genomic and clinical data, as well as studies of human diversity and evolution.

The Human Genome Project (HGP)

Table 1.1

Initial Goals of the HGP

From the First 5-Year Plan: 1993-1998 Table 1.2

A Blueprint for the Future of the HGP

15 Grand Challenges in the Third 5-Year Plan: 2003 – 2005 HGP budget – set aside for research on the ethical , legal, and social implication of genetic reserach (the

ELSI

project)

The Human Genome Project

The architecture of the Human Genome Project in the twenty-first century.

Three major themes

for future genome research are founded on

six pillars of genome resources

.

ELSI

Box 1.1 The Ethical, Legal, and Social Implications of the HGP

Funding – The National Human Genome Research Institute (NHGRI)  5% of its annual budget to ELSI Funding three types of activities: regular research grants, education grants, and intramural programs at the NIH campus Web sites: http://www.genome.gov/10001618 http://www.ornl.gov/sci/techresources/Human_Genome/res earch/elsi.html

4 major objectives 4 main subject areas

ELSI

Great concern is the

privacy and confidentiality

of genetic information.

Especially – Iceland ( 介於格陵蘭與挪威間 http://www.tita.org.tw/view/iceland.html

) and Estonia ( 愛沙尼亞共和國 http://www.suntravel.com.tw/zone/Europe/Estonia-136.htm

) 

government-sponsored databases of medical records

supplied to medical research companies.

have been

Psychological impact

and

potential for stigmatization

貼上標籤 ) inherent in the generation of genetic data  ( 給帶來恥辱 , 使

racial mistrust

and

socioeconomic differences

information in gathering of and access to genetic

Reproductive issues

Potential

moral

(possible

legal

) obligations once data has been obtained.

Philosophical

discussions –

“play God” human responsibility

with genetic material,

meaning of free will

genetically influenced behaviors , human right to in relation to

Genetically Modified Organisms (GMOs) 1998 – Five new major aims

1.7 (Part 1) Whose genome was sequenced?

The content of the Human Genome

Completion of the first draft of the HGP was announced at press conference in May 2000 , but publication of the result was delayed until Feb. of 2001 .

Need refinment prediction of the seq. assembly, including gap closure, gene annotation, and It is estimated that the total number of genes is somewhere around 25,000 times greater than gene contents of the fruit-fly and C.elegans, and five times greater than yeast, see Table 1.3

for more details) (~ two Table 1.3 Comparison of Gene Content in some Representative Genomes No dramatic differences in gene content between humans and other mammals.

Sep. 1994 intervals – the first high-resolution genetic map of the complete genome – 23 linkage groups (one per chromosome) with 1200 markers at an average of 1cM Around 1995 intervals – physical map 1998 – 3000 SNPs – 52000 sequence tag sites (STS) at ~60 kbp Middle of 2004 – 1.8 million mapped SNP, see The SNP Consortium (TSC) http://snp.cshl.org

Providing polymorphic markers at 2kb intervals 5kp of a SNP.

2000 – the first draft of published and placing 85% of all exons within the smallest human chromosome, chromosome 21 was

The content of the Human Genome

Two questions for the HGP (1) Whose genome was sequenced ?

The sequence is derived from a collection of several libraries obtained from a set of anonymous donors . Both the IHGSC and the private firm Celera Genomics assembled their seq. from multiple libraries of ethnicaly diverse individuals One particular indiveidual’s DNA contributed 3/4 and 2/3 of the raw seq. respectively.

Size of shaded sector ~ amount of seq. contributed by a single individual

The content of the Human Genome

The Celera sample included at least one individuals from each of four ethnic groups , as well as both males and females.

Craig Venter admitted that his own DNA contributed substantially Celera sequence to the Their own poodle ( 獅子狗 ) contributed to the first-draft canine ( 犬科動物 ) genome seq.

The Human Genome Project

(2) When can we regard it as finished ?

• The complete seq. of 99% of human euchromatin has been published to an estimated error rate of ~ 1 event in 100,000 bases.

• Human polymorphism is an order of magnitude greater this  at least 10 SNPs for each seq. error • Extensive tracts of heterochromatin genes, such as centromeres and telomeres), mostly 20% of the total genome, will • Since the completion of the first draft  HGP focus on than (there are few or no associated with centromeres that may account for as much as probably never be sequenced .

characteristing human diversity.

• International HapMap project – map all of the major haplotypes in the human genome and characterize their distribution among populations, as a step toward identification of human disease susceptibility factors , see http://www.hapmap.org

Internet Resources – NCBI and Ensembl

NCBI

http://www.ncbi.nlm.nih.gov

Ensemble

http://www.ensembl.org

– a collaboration between EMBL-EBI and the Sanger Center in the UK.

Both sites provide high-resolution physical maps of any segment of the genome.

Several genome views UCSC Genome Browser http://genome.cse.ucsc.edu

Commercial web sites - Incyte Genomics, Celera, Rosetta Inpharmatics, Informax, and LION Biosciences http://consert-lpg.obs.ujf grenoble.fr/html/en/rosetta_section2_wrapper.shtml

Figure 1.8 The National Center for Biotechnology Information (NCBI) Web site.

Internet Resources – NCBI and Ensembl

Ex. 1.2 Use the NCBI and Ensemble genome browser

examine a human disease gene. Use OMIM to identify a gene that is implicated in the etiology ( 病因學 to ) of the disease.

Ans.

Go to http://www.ncbi.nlm.nih.gov

one of the interest  This page gives a lot of textual information + link to other sites, including (HGDB) or Entrez Gene  Asthma for example, ( 氣喘 ) Interleukin 13  find (IL13). Human Gene Mutation Database

(a) What are the various

*147683

identifiers of the gene ?

(b) Where is the gene located on the chromosome (cytologically and physically) ?

The cytological location is 5q31 (chromosome 5, long arm, Click on

Gene map locus

 5q13  click location 5q13  click NCBI MapViewer   position132.02 Mb, Gene ID for IL13 is 3596 Gene aliases: ALRH; P600; IL-13; MGC116786; MGC116788; MGC116789

(c) What is the RefSeq for the gene ?

The RefSeq is NM_002188, an mRNA seq.

Internet Resources – NCBI and Ensembl

(d) How many exons are there in the major transcript, and how long is it?

From Entrez Gene  Display ‘Gene table’  4 exons, 1282 bp long and encodes a 146 amino acid protein , or use NCBI MapViewer  Consensus CDS (ccds) From RefSeq ID is NM_002188 link to GeneBank  signal peptide (interleukin 13 precursor), 34 aa (seq. 15 – 116), mat_peptide (interleukin 13 precursor) 98 aa

(e) What is known about the function of the gene?

See NCBI description - This gene encodes an immunoregulatory cytokine produced primarily by activated Th2 cells. This cytokine is involved in several stages of B-cell maturation and differentiation.

(f) Do the two annotations agree? Which browser do you prefer, and why?

Ensemble

http://www.ensembl.org

, select gene Ensembl gene ID ENSG00000169194  type IL-13  GeneView show that the Exons: 4 Transcript length: 1,282 bps Protein length: 146 residues

• •

Internet Resources - OMIM

Online Mendelian Inheritance in Man

A database that provides text summarizing recent genetic research in response to a query about a particular disease, as well as links to MedLine and GenBank and other information.

• Intended for physicians and human geneticists disease types such as muscle, metabolism, cardiovascular, and physiological disorders.

• OMIM lists in excess of 15,000 known disease causing Mendelian disorders.

GEO BLAST

tool – search for all genes in the

gene expression database

that have similar seq, and then compare levels of expression of the genes across species and experimental conditions. Figure 1.9 The Mendelian Inheritance in Man (OMIM) Web site

Internet Resources - OMIM

OMIM

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

Use OMIM help

Internet Resources - OMIM

OMIM has a defined numbering system – certain positions within that number indicate information about the genetic disorder itself.

The first digit – the mode of inheritance of the disorder

1

= autosomal ( 常染色體 ) dominant

2

= autosomal recessive

3

= X-linked locus or phenotype

4

= Y-linked locus or phenotype

5

= mitochondrial

6

= autosomal locus or phenotype

Internet Resources - OMIM

• • • The distinct between 1 or 2 and 6 is that entries cataloged before May 1994 either a 1 or 2, whereas entries after that date were assigned were assigned a 6 regardless of whether the mode of inheritance was dominate or recessive .

*

= the phenotype caused by the gene at this locus is not influenced by genes at other loci ; however, the disorder itself may be caused by mutations at multiple loci

#

= the phenotype is genetic mutations caused by two or more

Internet Resources - OMIM

Example: 604896 (MKKS)

Display allele variant

allelic variants –

description is given after each allelic variant of the clinical or biochemical outcome of that particular mutation

allelic variant for MKKS

Internet Resources - OMIM

The OMIM indicates that the gene SRY encodes a transcription factor that is a member of the high-mobility group-box family of DNA binding proteins. Mutations in this gene give rise to XY females with gonadal dysgenesis(

女性生殖腺發育不全症

), as well as translation of part of the Y chromosome containing this gene to the X chromosome in XX males.

Q 1a.

An allelic variant of SRY causing sex reversal with partial ovarian function has been cataloged in OMIM. What was the mutation at the amino acid level and what is observed in XY mice carrying this mutation?

Ans.

Use “SRY AND human” for the OMIM

search

 then view list of allelic variants .

Variant 0020 is the correct entry

. Mutation is Gln2Ter; XY mice are fertile females, although fertility is reduced and ovaries fail early.

Internet Resources - OMIM

Q1b.

Follow the Gene Map link in the left sidebar to access the MIM gene map,

one other gene is found at the same cytogenetic map location

. What is the

name of this gene

, and

what methods were used

location?

to map the gene to this

Ans.

Click GeneMap in the left sidebar.

Correct gene is ZFY

. Under the Methods columns,

REn and A

are listed. Clicking on the Methods hyperlink at the top of the column shows the key to the abbreviations. REn stands for neighbor analysis in restriction fragments; A stands for

in situ

hybridization.

Animal Genome Projects

The International Sequencing Consortium (ISC) http://www.intlgenome.org

- A database of animal and plant genome sequencing projects - Some of these organisms are shown in Figure 1.10

Figure 1.10 (Part 1) A gallery of animal genome sequencing projects

Animal Genome Projects

- At the National Human Genome Research Institute (NHGRI), the decision to commit the tens of millions of dollars required for any new genome is made by a council of senior genome scientists – a 10 page “white paper” - Weigh the expected impact of the sequence on enabling biomedical research and the annotation of sequence function A draft genome can be produced for most animals within 3-6 months Figure 1.10 (Part 2) A gallery of animal genome sequencing projects

1.10 (Part 3) A gallery of animal genome sequencing projects GenBank Files – Box 1.2

There are may ways to present the structure and annotation of a gene or seq.

due to alternative splicing and TSS, the small errors occur during cDNA cloning all genomes are full of polymorphism The same gene may be represented by multiple different seq. or annotations in the genome database Refseq – hand curation by experts Example – human HoxA1, 11421562 Go to http://www.ncbi.nlm.nih.gov/ 1. LOCUS: XM_004915, GI:14751246 2.

Followed by the reference, ….

3. Features section (CDS, misc_feature, .. etc), links to GeneID, MIM, CDD 4. Next comes the seq. in FASTA format, ‘Display’ in XML or ASN.1 file format

GenBank Files – Box 1.2

Use Entrez Gene – HOXA1 Two isoforms GenBank format Graph display – HOXA1

GenBank Files – Box 1.2

Ensembl http://www.ensembl.org/index.html

Gene – HOXA1

GenBank Files – Box 1.2

UCSC Genome Browser http://genome.cse.ucsc.edu

Gene – HOXA1

Rodent Genome Projects

Mouse Genome Informatics (MGI) http://www.informatics.jax.org/ Three major advantages of rodent research are 1. Existence of a large number of mutant strains that, combined with whole genome mutagensis  lead to genetic analysis of every identified locus in the genome 2. Existence of a panel of approximately 100 commonly used lab. mouse strains with well-characterized genealogy – a resource for the study of genetic variation 3. The existence of conserved seq. blocks is generally an indicator of functional constraint 2002 – draft of the Mouse genome 2004 – draft of the rat genome Figure 1.11 The Mouse Genome Informatics (MGI) Web site

Rodent Genome Projects

Functional genomic analysis of rat has been stimulated by three major advances achieved in the 1990s 1. The technology for targeted (Site-directed) mutagenesis homologous recombination of the wide-type locus with a disrupted copy by 2. Saturation random (unbiased) mutagenesis programs - Gathers information about entire “ sequence space ” – i.e., relationship between aa sequence, 3D protein structure and function 3. Emergence of ‘phenomic’( 表現性狀 ) analysis , in which mutagenized lines are subject to biochemical, physiological, immunological, morphological, and behavioral tests in parallel  large-scale identification of genes required for non-lethal ( 非致命 的 ) phenotypes

Rodent Genome Projects Conservation of gene order and DNA seq. between the human and mouse genomes http://www.ncbi.nlm.nih.gov/Homology/ (A)Blocks of synteny 11) and parts of five different human chromosomes between mouse (chr. (B)Enlarged view of a small region there is – human 5q31. In this approximately 1 Mb region almost perfect correspondence in the order, orientation, and spacing of 23 putative genes, including four interleukins.

(C)Enlargement of the alignment of 50kb that includes the genes

KIF-3A, IL-4 and IL-13

. Blue dots show the distribution of conserved seq. (with 50%-100% identity ). Two of the conserved blocks others (blue bars) exons of the genes.

(red bars) fall between genes, whereas most of the are in the introns and Use PipMaker http://nog.cse.psu.edu/pipmaker

Figure 1.12 Mouse-human synteny and sequence conservation

Exercise 1.3 Compare the structure of a gene in a mouse and a human

Rodent Genome Projects

Use NCBI http://www.ncbi.nlm.nih.gov

 choose Genome biology  mouse chr.11  use Maps and options  add human gene map

Rodent Genome Projects Mouse Genome Informatics (MGI) http://www.informatics.jax.org

- Integrate physical and genetic maps - Search for ortholog genes - Online comparison of the mouse and human genome

Rodent Genome Projects

Ex. 1.3

Use either NCBI or Ensembl browser, explore the structure of the gene used in Fig. 1.2 in a mouse and a human (and other vertebrates)

Ans.

Ensembl http://www.ensembl.org

– type in human IL13 (ENSG00000169194)  ‘ Orthologue Prediction ’  view all genes in ‘ MultiContigView ’  on mouse chr.11, human chr. 5, and rat chr.10

IL13 is

Box 1.2 (Part 2) GenBank Files

Other Vertebrate Biomedical Models

2004 – chicken (

G. gallus

) and dog (

C. familiaris

) genomes are fully sequenced Motivation – biomedical Chickens – model for oncogenesis and virology Dog – model for complex diseases such as asthma, parasite infection, cancer arthritis ( 關節炎 ), diabetes, and behavioral disorders

Applications

• Artificial selection on breed diversity • Research into avian ( 鳥類的 ) evolution

Vertebrate development

Zebrafish

• transparent embryogenesis, ease of culture, existence of dense genetic map • Found ~ thousands of genes are required for proper development of organs • http://zfin.org

• a variety of ecologically and commercially fish species, such as sticklebacks 刺魚 , cichlids 慈鯛 , salmonids

Other Vertebrate Biomedical Models 狗基因圖譜 定序完成 華盛頓郵報 2005/12/8 電

http://www.udn.com/2005/12/9/NEWS/WORLD/WOR4/3052845.shtml

可以用狗當作探討人類基因疾病的主要工具。因為某些狗罹患某些疾病 的機率遠高於其他的狗,如 薩摩耶犬易得糖尿病, 羅威納犬易得骨癌, 西班牙獵犬是癲癇症的高危險群, 杜賓犬罹患嗜睡症的比率遠高於其他的狗,這些疾病人類也很常見。

克隆羊「桃莉」 http://scc.bookzone.com.tw/sccc/sccc.asp?ser=302

Other Vertebrate Biomedical Models

Sequencing nonhuman primates , such as rehsus macaque (

獮猴 )

, chimpanzee( 黑猩猩 ) – intend to understand the origins of diversity immune system as well as in the mechanisms of pathogen resistance Comparison of human and chimp seq.

• Many genes seems to have been positively selected • Huamn are differentiated from chimps by small deletions up to 10kb in length, which occur on average every 500kb along chromosome 21

Animal Breeding Projects

OMIA (Australia)

– genome maps for over a dozen species of agricultural importance http://www.angis.org.au/Databases/BIRX/omia • Access data on inheritance patterns for species other than human and mouse • Benefits of breeding programs lie in improvements in yield, infectious disease resistance adaptation to climatic conditions, improved food quality, maximizing the benefits of transgenic technology • These goals will be met both through enhanced genetic map development and association studies using SNP technology

ArkDBs (UK, Roslin Institute in Edinburgh)

http://www.thearkdb.org

• genomes resources for ~10 species

Invertebrate Model Organisms

Generic Model Organism Database (GMOD) http://www.gmod.org

- A coordinated effort of the mammalian, invertebrate, and plant genome communities to standardize web tool construction and implementation and to provide open source software for database management Figure 1.13 The GMOD project

Invertebrate Model Organisms

A 40 kb region of cytological band 43E of fruit fly, centered on the saxophone gene.

Figure 1.14

Drosophila

gene annotation

Invertebrate Model Organisms

Flybase

http://www.flybase.org/ - Search for the gene symbol : sax click the ‘gene region map’ http://www.flybase.org/cgi bin/gbrowse_fb/dmel?ref=2R;id=FBgn0 003317 - each gene either has a number beginning with CG or is identified by its standard name (e.g. sax) - show gene and mRNA

transposable element insertions

(

Burdock

, one is shown in

pink

)

Invertebrate Model Organisms

• The first multicellular eukaryotes to be sequenced completely is

C. elegans

at 1998 http://www.wormbase.org

• Fruit fly –sequences completed at 2000 • Decades of genetic analysis have led to the molecular characterization of up to 20% of the complement of genes in these two organisms • Over 90% of the true genes seem to have been identified • Assigned a tentative function based on seq. similarity • 1/3 ~ 1/4 of the predicted genes remain ‘orphans’  with no known seq. similarity to genes in any other organism  without functional data

Invertebrate Model Organisms

• Ongoing EST sequencing, gene structure and mutational analysis • Unexpected – there may be 50% more genes in

C.elegans

genome (19,000) than there are in the fly genome (13,500), system • Nematode – a surprising surplus of steroid 類固醇 despite the fact that the fly is much more complex at several levels, including (1) the number of cells, (2) number of cell types, and (3) organization of the nervous -hormone receptors • Fruit fly – olfactory 嗅覺的 receptor family • There is • The no simple relationship between gene number and tissue complexity high degree of conservation of all the major regulatory and biochemical pathways , most of all are identifiable not only in both nematode and flies but also in the unicellular eukaryote yeast and in vertebrate genomes

Invertebrate Model Organisms

Functional genomics  a major impact of the invertebrate genome projects is the prospect of obtaining mutations in every single gene of the genomes In fly – by a combination of saturation mutagenesis + a library of overlapping deficiencies (deletion) that remove every segment of each chromosome In nematode saturation mutagenesis + RNAi double-strand RNA fed to the worms (a

Invertebrate Model Organisms

>60% of a sample of 289 human disease genes have an orthologous genes in the fly <60% in nematode ~20% in yeast Fig. 1.15 shows the fraction of human disease genes in each of six categories that have orthologs in the fly, nematode and yeast genome , as detected by seq. similarity at three level of significance Conservation of genetic interactions across the animal kingdom  uncover genes that are interact with known disease-promoting loci Pharmaceutical companies – interested in invertebrate genomics for its potential to identify drugs that affect neural function Example: fluoxetine resistance in nematodes, alcohol tolerance in files Molecular interactions between gene products can be conserved allows the functional comparison of genes across species

Figure 1.15 Human disease genes in model organisms

蜜蜂 (Honey Bee) 基因定序 http://www.udn.com/2006/10/31/NEWS/WORLD/WOR4/3581547.shtml

海膽

(Sea urchin)

基因定序 http://tw.news.yahoo.com/article/url/d/a/061110/2/6cqy.html

Box 1.3 Managing and Distributing Genome Data

Internet technology is essential for genomic scientists NCBI, EBI, LIMS (laboratory information management systems) DB – RDB ( relational DB ) and OODB ( object-oriented DB ) RDB – very effective for sorting, searching, and distributing data that fits into table form OODB – good at handling complex data structures and are useful for performing analyses on sequence ‘objects’ (data + with functions for operating on the data)  a very efficient programming approach DB query language = SQL = structured query language http://www.geocities.com/SiliconValley/Vista/2207/sg17.html

Scripting language (no need to compile) = PERL = good for extracting and processing text files http://bio.perl.org

Box 1.3 Managing and Distributing Genome Data

Plant Genome Projects

Arabidopsis Thaliana

– the first plant genome to be sequenced between 1999 and 2000 • ~115 Mb, ~25,000 genes, ~2 times (no. fly genes) • Evolved via two rounds of whole genome duplication shuffling 隨意混和 of chromosome regions and  considerable gene loss • >1500 tandem arrays (generally 2 or 3 copies) of repeated genes have been identified, ~11,000 gene families • Some geneticists regard this number as representative of the minimal complexity required to support multicellularity • It is believed that all plant and animal genomes represent modifications of a ‘toolkit’ of gene families that evolved >10 9 years ago

Plant Genome Projects

>30 Segmental duplications (A) 7

intra

-chromosomal duplication are shown as duplicated blocks of color within three of the five chromosomes; five duplications occur in the first chromosome and the fourth and fifth chromosomes display one duplication piece (B) Anther two dozen

inter

chromosomal segmental duplications. A twist  inversion in the band accompanied the duplication event

Figure 1.16 Chromosome duplications in the Arabidopsis thaliana genome

Plant Genome Projects

Plant genomes – plant-specific genes Enzymes required for cell wall biosynthesis Transport proteins that move organic nutrients, inorganic ions, toxic compounds, metabolites, and even proteins and nucleic acids between cells Enzymes required for photosynthesis , such as Rubisco and electron transport proteins Products involved in plant turgor photo trophic 趨光性 細胞之正常膨脹 , and gravi trophic 趨地性 Enzymes and cytochromes involved in the production of second metabolites found in flowering plants A large number of pathogen resistance R genes, as are mammalian immune system. R genes are dispersed throughout the genome rather than localized in a single complex

Plant Genome Projects

• Plants share with animals many of the gene families - Intercellular communication, transcriptional regulation, signal transduction •

A. Thaliana

lacks homologs of the Ras G-protein family and tyrosine kinase receptors, Rel, forkhaed, nuclear steroids receptor transcription factors •

TAIR

– The Arabidopsis Information Resource http://www.arabidopsis.org

UK CropNet

http://ukcrop.net/

Grasses and Legumes

豆莢 >50 different plant species are under way The most important – major feed crops – the grasses maize, rice, wheat, sorghum 高粱 , barley 大麥 , the forage 飼料的 紫花苜蓿 , forage rye 黑麥 酥油草 )  legumes soybean, alfalfa grasses, fescues( 羊茅 , several genomes are very large  whole genome sequencing is impractical Both rice (

Oryza sativa

) and maize have relatively small genomes (

Zea mays

Two major rice genome cultivars 培育品種 ,

japonica

rice 禾更米 and

indica

rice 秈米 ) MaizeGDB http://www.maizegdb.org

waxy rice 糯米

Rice-

Arabidopsis

synteny

• • Comparison of genome sequences of rice and arabidopsis  extensive complex patterns of synteny • 20 of 54 genes in a 340 kb long of the

rice genome retain the same order

in

five different

(top) 80- to 200-kb regions of the

Arabidopsis genome

(below).

Conserved genes both

(

red

and

green

boxes) are rice and Arabidopsis strands, but are

found on interspersed by a variable number of different genes

Arabidopsis.

Shaded boxes above

(

yellow

boxes) in the rice chromosome indicate that the

conserved genes is in the opposite relative orientation

on the Arabidopsis chromosomes.

rice

Figure 1.17 Rice-

Arabidopsis

synteny

Grasses and Legumes

Economically important traits include resistance to a broad range of pathogens; flowering time, seed set, grain morphology, and related yield traits; tolerance to drought, salt, heavy metals and other extreme environmental circumstances; and measures of feed quality such as protein and sugar content.

Improved through genetic engineering + specialized plant breeding techniques Genome projects  reveal much information regarding the evolution of domesticated species

Grasses and Legumes

Teosinte 墨西哥類蜀黍 versus Maize 玉蜀黍 •

Modern maize is a derivative of the wild progenitor

teosinte, which had multiple tillers.

• Throughout the coding region of

tb1

,

the level of polymorphism is substantially the same

in a sample of maize and teosinte. However, in the

5’ UTR, there is a dramatic reduction in the level of polymorphism in maize

relative to that seen in teosinte.

Figure 1.18

Teosinte branched 1

and the evolution of maize

Other Flowering Plants

• >90 angiosperm genome projects are listed on the US department of Agriculture web site http://www.nal.usda.gov/pgdic/Map_proj • African, Australian, European, US projects • Genetic maps and search for a common set of plant genes • For some species, large EST seq. projects are also in place  enable comparative genomic analysis • Arabidopsis + grasses + several model organisms  light on plant evolution shed

Other Flowering Plants

Forest trees

– potential for economic impact High-density genetic maps of spruce, loblolly and several pines, a few species of Eucalyptus Trait – wood quality, growth and flowering parameters Dendrome web site http://dendrome.ucdavis.edu

Comparative analyses and transcription profiling of genes involved in

wood properties

including lignins 木 質素 and enzymes that regulate cell wall biosynthesis

Crops plants

– potato, tomato, tobacco, beans, cotton Analyzing the genome diversity 

affect productivity, yield and quality improvements No plant equivalent of the HGP’s ELSI initiative

has been established.

Figure 1.19 Forest genomics

t

Microbial Genome Projects

The minimal genome

• 1995 – the 1st complete genome, H. influenzae  M. genitalium  3 other bacteria • 1997 – E. coli • Seq. information –

genome structures

(GC content, transposable elements, recombination),

genome content

(total number of genes, conserved gene families)

Gene annotation for prokaryotes are more straightforward

– ORF tend to be uninterrupted and genes tend to be closely spaced; however the assignment of genes to operons is not trivial •

~3/4 microbial genome can be assigned a function based on their similarity

domains • TIGR http://www.tigr.org

to genes on other organisms or by identifying protein

Microbial Gene contents

M. genitalium

0.6 Mb, 471 genes

H. influenzae E. coli K12

1.8 Mb, 1750 genes 4.6 Mb, 4288 genes  average gene length ~ 1.1 kb Gene duplication and divergence in large genomes ,  gene loss in small genomes

Exercise 1.4 Compare two microbial genomes using the CMR

The minimal genome

– the minimum complement of genes that are necessary and sufficient to maintain a living organism To define genetically ‘What is life’?

• • Two general strategies Bioinformatics strategy – identify which genes are present in each and every sequenced genome Some functions can be performed by non orthologous genes Conserved orthologs + a small number of alternatives ~256 genes

The minimal genome

Experimental strategy – systematically knock out the function of individual genes: mutations that cannot be recovered define genes that are likely to be components of the minimal genome • •

M. genitalium

– recovered 120 of the 470 genes

B. subtilis

要的 ) (~4100 genes) – 271 genes are indispensable ( under favorable growth conditions, metabolism, cell division and shape, synthesis of cellular envelope 必 •

Synthetic lethal (

綜合的致命

)

– the nonviability ( 無存活能力 ) in combination of two or more individually viable mutations • Infer that life can be supported by a genome of

between 250 and 350 genes

• Build a viable organism from scratch by stitching ( 組在一起 ) together artificially synthesized genes 髓灰質炎病毒 ) – build a poliovirus ( 脊

The minimal genome

Deeper color

presence of a gene

Pale color  the genes is absent in that species Gene a, d, f are present in all species, so are inferred to be necessary for life.

Figure 1.20A Describing the minimal genome

The minimal genome

Mutagenesis experiments - Establish which genes are essential by systematically knockout each functional genes and seeing whether the organism can survive without it .

- The overlap of these two approaches may define the minimal genome .

Figure 1.20B Describing the minimal genome

1.21 TIGR representation of a typical microbial genome

Sequenced Microbial Genomes

TIGR – Comprehensive Microbial Resource (CMR) http://www.tigr.org/tigr scripts/CMR2/CMRHomePage.spl

New site http://pathema.tigr.org/tigr scripts/CMR/CmrHomePage.cgi

39 genomes were generated by TIGR, and the rest by Brazil, Japan …  Omniome DB

Streptococcus pneumoniae TIGR4

The outer and inner circles represent genes encoded on the two strands of the chromosomes Genes from

HMM – blue BLAST – yellow

,

Omniome – pink Click ‘align genome’ – MUMMER Click ‘Analyses’ – for more tools, such as COG/TIGRFAM/PFAM

Box 1.2 (Part 1) GenBank Files

Environmental Sequencing

Sequencing DNA extracted form an environment such as ocean, soil, or intestinal flora ( 腸道微生物 ) The main reason is that the vast majority of bacteria cannot be cultured in vitro  our knowledge of microflora is both limited by and biased by sampling Pilot projects – identify novel genes has the potential to change oceanographers’ understanding of the mechanisms of photosynthesis and global carbon and nitrogen cycling Proteorhodopsin genes – suggesting that light harvesting need not be coupled to chlorophyll in cyanobacteria C. Venter – identified >1M new genes !!, almost 150 new types of bacteria Fecal material – human gut contains > 500 different species of bacteria, < 30% can be cultured outside the body

Yeast Completed at 1997 MIPS http://mips.gsf.de/genre/proj/yeast/index.jsp

SGD http://www.yeastgenome.org

Parasite Genomics World Health Organization (WHO) • 10 tropical diseases that affect billions of people worldwide • Eradicating ( 根除 ) the pathogenic agents • Crop damage caused by parasitic plant nematodes costs billions of dollars

Parasite Genomics Aims 1. Identify species-specific genes 2. Understanding the developmental genetics 3. Polymorphism surveys that address the population biology of the parasites 4. Mapping the genomics of the mosquito

100 genomes, 10 days and 10 million dollars Awards 2006 News, http://www.biotechnews.com.au/index.php/id;1321634104

全球首見 實驗室做出人類精子 2009/07/09 http://udn.com/NEWS/WORLD/WOR4/5008159.shtml

The End