Transcript Document

Informatics Support of
Data management for
multi-centric clinical
studies: Integrating
clinical and
genetics/genomic data
Prakash M. Nadkarni, Kexin Sun and Cynthia Brandt
Scope and Definitions




Clinical research is increasingly concerned with the
influence of inheritable traits on disease.
Genetics: The study of the organization, regulation,
function and transmission of heritable information in
organisms. (UniGuide Academic Guide to the
Internet, www.aldea.com)
Genomics: investigations into the structure and
function of very large numbers of genes undertaken
in a simultaneous fashion. (UC Davis Genome
Center, http://genomics.ucdavis.edu )
The choice of phrase depends on your age group
and your agenda (Dr. Rochelle Long, NIGMS)
Types of Genomic Research

Structural genomics



An initial phase of genome analysis, whose end-point is to
yield high resolution genetic and physical maps of an
organism. The ultimate physical map of an organism is its
complete DNA sequence.
The sequence itself can never be known with complete
precision, because there is variation in parts of the sequence
that is seen across individuals.
Functional genomics


Development and application of large-scale (genome-wide or
system-wide) and/ or high throughput approaches to assess
gene function, using info/ reagents provided by structural
genomics.
Combined with statistical/computational analysis of results.
Variation in Sequence: Mutations
and Polymorphisms



Some parts of the sequence of an organism appear
highly stable, while others show variation.
A variation that is prevalent enough to occur in at
least 1% of a population (not necessarily the
human race as a whole) is called a polymorphism.
The polymorphisms that have been most widely
studied are single nucleotide polymorphisms
(SNPs). However, repeats are also important.
(Hungtington’s chorea characterized by repeats in
part of a gene – the more the repeats, the greater
the likelihood of severity + earlier onset.)
Genotype and Phenotype (I)



An important aspect of functional genomics –
variation in sequence (genotype) leads to variation
in function as expressed at molecular, cellular,
organ and system levels (phenotype).
Phenotype is “the outward, physical manifestation
of internally coded, inheritable information”
(Blamire, http://www.brooklyn.cuny.edu/).
“Correlation of genotype to phenotype” is one of the
goals of several recent cooperative efforts – e.g.,
the Pharmacogenetics Research Network.
Correlating Genotype with
Phenotype


Clinical Studies that determine this correlation
commence in two directions
Genotype to phenotype: screening of large numbers
of individuals at particular genetic loci (that encode
for known proteins) identifies particular variations,
which may result in a variant end-product. These
subjects are persuaded to enroll in studies where,
along with controls, they are challenged with
particular drugs. Variations in response are then
measured.
Correlating Genotype with
Phenotype (II)


Phenotype to genotype: patients with particular
clinical traits (e.g., poor response to standard
therapies for specific conditions) are identified.
These subjects are then screened at multiple
candidate loci (for genes known or suspected to be
involved in that disease) to identify variations. For
conditions such as heart disease or hypertension,
hundreds of loci may have to be screened.
One practical problem here is that of statistical
significance – if a sufficiently large number of loci are
screened, “significance” may be seen even with
random data sets.
Representing Genotype
Computationally



Genotype is relatively straightforward to represent
computationally, as variations from a consensus
sequence- substitutions, insertions, deletions, or
variations in repeats.
Length of sequence to consider - Rather than
focusing on individual variants in isolation, it is
preferable to consider a set of several such variants
that are inherited as a unit (the haplotype).
NCBI’s dbSNP database supports genotype and
haplotype representation.
Human Phenotyping Studies and
Repositories

“Phenotype” means different things to clinical
researchers and to classical human or animal
geneticists.
To the latter, it has traditionally been a “syndrome”,
consisting of one or more detectable or visible traits.
 These days, it is more likely to be defined in terms of
variation from the norm (for better or for worse). It is
characterized by clinical studies.
 The single most useful catalog of human variation is
Online Mendelian Inheritance in Man (OMIM),
maintained by Victor McKusick’s team at Johns
Hopkins and made accessible via NCBI’s Web site.

Human Phenotyping Studies and
Repositories (II)

The ultimate goal is to create national repositories
of phenotypic data that are computable, in that they
contain structured data.
The purpose behind the ability to store “raw” data is
to facilitate possible mining of the data.
 OMIM is a text database, and despite its great value
has limited computability.

Challenges in Representing
Phenotype




Phenotype is not a single entity: it is a set of
parameters.
The universe of parameters constituting
“phenotype” is highly variable: function can be
characterized at a molecular, organelle, cellular,
organ-system or whole-organism level.
The parameters are specific to the gene or genes
that are being studied.
Across all genes or genetic disorders of interest, the
total number of parameters would range in the
hundreds of thousands.
Creating Databases to record Phenotype
Characterization Studies

The problem of representing phenotypic data is
very similar to the problem of representing clinical
patient data in clinical patient record systems.


A vast number of clinical parameters can potentially
apply to a human subject, but for a given clinical
study, only a modest number of parameters actually
apply.
The same modeling approach – (Entity-AttributeValue) can be used.
Historically, first used in the TMR system (Stead and
Hammond), and later, the HELP system at LDS.
 Put on a firm relational database footing by the
Columbia-Presbyterian CDR efforts

Clinical Study Data Management
Systems vs. CDRs (I)



In Clinical Study databases, clinical data gathering is not
open-ended. It is typically segregated into events (“visits” at
an outpatient level), whose schedule is determined by the
study protocol.
The parameters that are recorded at each Event is
determined in advance. For reasons of patient safety and
economy, all parameters are not sampled at all events.
In clinical studies, individual response to therapy is less
important than how subjects react as a group. Relative time
points based on Events are therefore more important than
the absolute date-time when an event occurred. (This
impacts temporal querying of the data.)

Parameter report
Clinical Study Data Management
Systems vs. CDRs (II)



Certain areas, e.g., psychiatry, are characterized by
extensive data gathering based on questionnaires. Most
questionnaire items do not map to standard controlled
vocabularies – each questionnaire in effect constitutes its
own vocabulary.
During data entry for questionnaires, extensive
dependencies between individual parameters require
support for “skip logic” – certain parameters disabled for
entry based on values entered for previous parameters.
In general, automatic generation of Web-enabled forms for
robust data entry is a high priority, especially when
numerous concurrent studies are to be supported with
modest human resources.
Clinical Study Data Management
Systems vs. CDRs (III)

Often, a set of parameters, rather a single attribute,
conveys meaningful information. E.g., to describe
an adverse drug reaction
The nature of the reaction as described by
patient/clinician
 The best match to a controlled vocabulary term
 Severity- this is usually “anchored” to a reference
scale (e.g., NCI common toxicity criteria) where
possible.
 Whether it responded to treatment
 Whether therapy needed to be stopped.
 “Severity” is meaningless in isolation (severity of
what?).

Handling Genetic Data (I)



While the human genome has been sequenced, we don’t
know what the vast majority of the DNA does New genes
are still being discovered.
Traditional approaches, such as collection of pedigree data
and linkage analysis, still apply.
For voluminous data such as Mass-spectrometry data
(Proteomics) or even raw Gene Expression/ Microarray
data, consider storing the data in its original format for the
most part, with the database only tracking the location of the
data files.


Decomposing increases the bulk greatly, with questionable
benefit for a stream of X-Y pairs
Many analytical programs have been created to operate on
data in their original formats.
Handling Genetic Data (II)



The description of gene array experiments lend themselves
to attribute-value modeling approaches, because, despite
efforts to create controlled-vocabulary descriptors, many of
the descriptors are specific to the research problem being
studied.
Certain summary results may be databased.
Consider the use of display technologies like Scalable
Vector Graphics (SVG) to generate interactive graphics.



Pedigree diagrams
Summary of polymorphism data for individual genes.
SVG is based on XML- Web server generates a stream of
XML that is interpreted by a plug-in as a stream of drawing
instructions, to render a graphic.
Interchanging Data


Be prepared to bulk-import from a variety of formats (e.g.
spreadsheets) to bootstrap the database from legacy data.
XML is potentially attractive, but must be used judiciously.




Remember that someone must write a program to put the data
into XML format.
For phenotypic data, creating an endless number of domainspecific tags achieves little beyond full employment for
programmers- consider simple formats that are the direct
counterpart of attribute-value structures – EDSP.
Avoid highly nested structures that increase programming
effort.
XML should follow from the data model – it should not be used
to define a data model (lessons from the MicroArray and Gene
Expression Data (MGED) group.