Transcript Document

The
TM
GeneCards
Project
at the
Weizmann Institute of Science
• For each gene - a card with displayed data
and links to entries in major databases
• Genes with HUGO nomenclature symbols
and others
• Automatic data mining and integration
• Advanced human-computer interaction
http://bioinformatics.weizmann.ac.il/cards/
chromosome
gene
DNA
sequence
disease
protein
research article
RNA
gene
alias
similar
mouse
gene
mutation
medical
applications
marker
genetic
map
chromosomal
location
Databases Containing
Human Genome Information
EMBL
Swiss-Prot
UDB
GeneMap'98
Genethon
GDB
GAC
GENATLAS
Whitehead/MIT
CFTR
Sanger Centre
GeneCards
UDB
NCBI
CEPH
Stanford
OMIM
GenBank
WashU
UniGene
dbSNP
PKR
TIGR
PRODOM
BRCA1
HOVERGEN
CHLC
Pfam
GESTEC
PDB
IMGT
BLOCKS
HGMD
PRINTS
Utah
TP53
Marshfield
LDB
TGD
COPE
PIR
DDBJ
GeneCards: From Chaos to Order
A card for each gene
o
o
o
o
o
o
o
o
o
Aliases
DNA, RNA
Protein
Chromosomal location
Disorders
Medical applications
Related mouse gene
Research articles
Links to more data
Data is retrieved and integrated automatically
Data Related to Genes
Nucleotide SEQUENCE-Genomic/cDNA,
-coding/regulatory
VARIATION (polymorphism, mutation)
Chromosomal LOCATION
G
E
N
EXPRESSION (tissues, developmental, disease)
PROTEIN - sequence, domains, 3D
- subcellular location
- 2D electrophoeresis
Biological PATHWAYS
DISEASE
E
PHARMA (diagnostics, vaccines, drugs)
ORTHOLOGS(model organisms, knockout)
Commercial DNA ARRAYS
PATENTS
GeneCard:
Integrated Data and Starting Point
Mining and
Integration of Data
A Starting point
for More Data
Entries in
Data Sources
of GeneCards
GeneCard
link to
link to
link to
Data Sources
of GeneCards
other
Data Sources
other
Data Sources
A typical GeneCard: RUNX1
HUGO nomenclature gene symbol
Accession ID to other databases
LocusLink or HUGO location
If chromosome 21
Information on proteins
For chromosome 21 only
Sequence accessions
Single nucleotide polymorphisms
Homologues
Disorders and mutations
Medical news from Doctor’s guide
Published literature
Snapshot of additional GeneCard fields
Additional information
Start new search
Improved Single Nucleotide Polymorphisms
Summaries
Current GeneCards Data Sources and Links
HUGO GDB OMIM SWISS-PROT
LocusLink UDB UniGene MGD DOTS UCSC
GenBank PubMed CroW 21 Doctor’s Guide
HUGE euGenes Genatlas ATLAS HGMD TGDB
BCGD MTDB RZPD MIPS PDB BLOCKS
HORDE dbSNP ENSEMBL SBCELEGANS
GeneLynx
IMGT
SOURCE
Gene sources
13,046
HUGO
360
LocusLink
MGD
8,951
CroW 21
63
How to search and find?
Simple search box
search keywords
results
gene 1: name
- ... keyword ...
- ... ... keyword .
gene 2: name
- keyword ...
no results
spell corrections
query modification
outside resources
Some GeneCards Statistics
27,612 GeneCards (November, 2001)
13,548 HUGO approved genes
2,646,185 Accesses to GeneCards (at WIS since
January 1, 1998)
25 Mirror sites around the world
The Affymetrix System
Genechip Procedure
Sample
Hybridization
preparation
Fluidic station
Signal detection
Scanner
Data analysis
Software
ChipCards - A Functional Integration Tool
for DNA Array Data
Tsviya Olender, Shirley Horn-Saban, Marilyn
Safran, Vered Chalifa-Caspi, Michal Ronen and
Doron Lancet
The Crown Human Genome Center
The Weizmann Institute of Center, Rehovot 76100
About ChipCards
ChipCards correlates DNA array data with comprehensive
information from gene-specific databases. It is currently
implemented for the Affymetrix GeneChip.
ChipCards’s output is an HTML table with essential additional
information for each gene including: gene symbol, functional
definition, accession number, protein information, chromosomal
location and EST data.
Human data is integrated with GeneCards, UDB and Unigene.
Mouse data is integrated with information about the human
orthologue via GeneCards, HomoloGene and MGD.
Example of GeneChip output before ChipCards
processing
An Extract of Human Expression Data After ChipCards Processing
NCBI link
GeneCards link
UDB link
A snapshot of ChipCards’s result, with human Affymetrix expression data as input.
Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the gene
is extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates.
Murine Expression Data After ChipCards Processiong
Human orthologes data
NCBI link
GeneCards link
Murine’s
Unigene link
Human’s
Unigene link
NCBI link
A snapshot of ChipCards output for Mouse Affymetrix expression data.
Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table
includes links to NCBI, GeneCards and Unigene.
Current Research - Adding Cards for Genes
that Don’t Yet Have a Name
Unigene
cluster
1
Assembly-based
resources
2
3
4
5
Gene
sequence
tag
Unique
persistent gene
identifier
GeneCard
for novel
gene
Version 3.0 Project Goals
Improving flexibility, allowing automated
parameterized generation from partial sets of sources
and/or genes, and appending to an existing database
Providing an Application Programming Interface for
users of the generation software to incorporate their
own data
Standardizing the format of the database to use XML
Project Goals
(cont’d)
Providing a foundation for supplying a stable identifier
for each GeneCard, even when no known gene symbol
exists
Improving the maintainability, testability, and
quality of the software
Providing a seamless migration path from Version 2.xx
while maintaining the current look and feel and
functionality
Pros and Cons of Using OOP
• Perl not originally
designed as an OOP
language
• Type safety, proper
encapsulation and
aggregation aren’t
enforced
• Can be between 20
and 50 % slower
• Allows for more robust
implementations
• Greater modularity
• More comprehensible
interface to modules
• Better abstraction of
software components
• Less namespace pollution
• Greater code reusability
• Software scalability
• Cleaner and more compact
code
The 3.0 Hybrid Solution
Combines an object-oriented skeleton with
some non object-oriented internals
•The large data structure of gene-based data is implemented as
a hash of hashes, avoiding numerous costly instantiations
•All other major components, including the extractors and
administration classes, are implemented as objects
GeneCards Architecture
Generation Software
UniGene Extractor
SwissProt Extractor
API
GeneCards Database
Customized
Extractor
Support Functions
Display
Software
Generation Software Classes
An underlying layer of support tools that manage extracting data from
locally mirrored files and the internet, proxy connections,
verification, security, file management, caching, conflict detection,
error handling, statistics, and XML output formating
A set of extractor classes, one for each source of information using
source-specific algorithms and heuristics (adapted from pervious
versions of GeneCards). Methods include new, prepare and search
A template for building extractor classes. All such classes can create
new or append to old entries, as well as generate data for all entries
(genes) at once, or one at a time
A main class that handles building sets of cards according to
parameterized partial ordering rules
The XML-Based Database
XML is a meta-language that supports customized tags for describing and
providing semantic meaning to structured data
Typed elements are arranged within other elements
to form a nested hierarchy
The data is grouped by source in the XML files, but can be retrieved by function:
<GCresource>SWISSPROT
<GCresource>OMIM
<protein>
<disorder>Colorectal Cancer
<disorder>Germline Cancer
</disorder>
</disorder>
</GCresource>
</protein>
<GCresource>GENECLINICS
<GCresource>
<disorder>Li-Fraumeni Syndrome
</disorder>
</GCResource>
Each extractor module is responsible for its own Document Type Definition
(DTD) specification to ensure that the XML is well formed and valid
Files are stored in a hierarchical directory structure, one file per gene
The Display Software
Currently in the design phase
Want to maintain the current look and feel while
providing the flexibility of easy customization
Will use XML Perl parser modules in cgi scripts
Search will be expanded beyond current text-based
capabilities to include context-specific searches
3.0 Project Status and Open Issues
Procedural programs/ad-hoc flat file format
Object-oriented methodology/standardized
XML
Easy to add new extractors
Flexible and extensibile
Performance , Searching strategies
Unified Database (UDB)
Data mining and integration
Original
public
databases
Data mining
Semantic
Integration
Thesaurus
Source-specific information
Megabase
Integration
UDB
Integrated chrmosomal maps
Sequence-Based Repositioning
(SBR)
Placing finished genomic sequences
on UDB map.
Map fine tuning in sequenced regions.
SBR (Sequence Based Repositioning)
Elimination of
overlaps between
contigs
Object repositioning
UDB original map
SBR map
Search Results - a Map Slice
to GeneCard
to Unigene
to MarkerCard
A MarkerCard
GeneCards Success Stories
• GeneCards as a bookmark for linkage analysis
• Mutations that were polymorphisms and not
disease-causing
• Adult-onset diabetes without obesity in India
• Work on Chromosome 21 at the Weizmann Institute
• PVT – a heart disease found in Israeli Beduins
• Parkinson’s disease paper
Frequently Asked Questions
• What’s special about GeneCards?
• Can I interface my own data?
• Can I access my own in-house database mirrors
instead of public internet sites?
GeneCards/UDB Team
current:
Avital Adato
Vered Chalifa-Caspi
Michal Lapidot
Zvia Olender
Naomi Rosen
Marilyn Safran, head
Orit Shmueli
Irina Solomon
Doron Lancet, PI
alumni:
Michael Rebhan
Shai Shen-Orr
Inga Peter
Jaime Prilusky
Michal Ronen
Hershel Safer
Julie Stampnitzky
Liora Yaar