Transcript Document
Biogen Idec Literature Informatics
for Drug Discovery
William Hayes, PhD
Phoebe Roberts, PhD
March 19, 2007
Mission
• Provide
– access to literature and text resources
– tools to access and manage literature and
text resources
– expert analyses of literature and text
resources
– the most advanced tools and analyses
available
Agenda
•
•
•
•
Value Proposition
Literature Informatics Overview
Projects
Summary
Value Proposition
• A recent trend in the industry is to cut
the library to a bare operational staff - to
manage E-journals and document
delivery
• To do so eliminates our ability to make
knowledgeable decisions for drug
development
The Scope of the Literature Problem – You cannot keep up!
The annual worldwide production of information in publications is estimated
as 8 TB in books, 25 TB in newspapers, 20 TB in magazines, and 2 TB in
journals
Every minute scientific knowledge
increases by 2,000 pages
It takes five years to read the new
scientific material produced every
24 hours
80% of information is stored as
unstructured text
The number of papers associated
with a pharma target:
in 1990 = 100
in 2001 = 8
Library -> Literature Informatics
• Deliver information
• Requires variety of skill sets (Library
science, operations, technical,
informatics, domain expertise)
What is Literature Informatics ?
• Applying data mgmt and analytical technologies to
extract and store knowledge from
scientific/business literature
• Analytical technologies:
–Information retrieval
–Text mining
–Semantic reasoning and inference
• Analytical objectives:
– What protein interactions can be found in the corpus?
– Which gene expressed in a particular pathway with respect to a
special disease for a special genetic group
– Which compounds inhibit a protein?
– Which documents found are toxicology-related?
– Show me all co-occurring genes and diseases
Literature Informatics Benefits
• Much more efficient overview of research
areas
– Save significant time for individual
researchers/the company
• Ability to effectively extract information from
hundreds to millions of documents
• Greater than 10X improvements in speed of
analysis and recall
• More value captured from $Millions spent on
literature content and research
External vs Internal Research Dollars
• US Total: $94.3B (2003) (JAMA. 2005;294:1333-1342)
– Public 43% - NIH(28%), Other Federal (7%), State/local
gov (5%), Charity (3%)
– Private 57% - Pharma (29%), Biotech ~1500 companies
(19%), Device (9%)
• Pfizer R&D (2004)
– $8B (3.5X of Pfizer spend from one funding agency!)
• Biogen Idec - 3rd largest biotech
– $684M (2004) R&D (0.7% of US Total)
Number of Papers Published
(from Pubmed)
2002
Medline
2003
2004
2005
550538
580725
619626
681899
Pfizer
381
444
490
460
Biogen
Idec
70
53
51
52
Text Analytics Financial Analysis
• Given 1000 researchers
• 22% time searching and analyzing
literature (Outsell survey 2002)
• 220 person-years per year analyzing
literature
– $22M / year
• Significant percentage of that time is
retrievable using advanced text analytics
and expert analysts
Front-loading Safety Concerns
• Lead optimization (LO) costs ~$126M (Tufts
survey)
• LO projects take between 2-4 years
• ~50% LO projects undergo attrition due to
safety concerns (Tufts survey results)
• ~50% of safety issues had literature indicators
at beginning of project (anecdotal evidence)
• $25M per 4 LO projects can be recovered IF
comprehensive literature analyses can speed
up Safety analyses by 20%
Text Analytics Impact
• Case 1: start with an unknown protein, determine interaction
network. No standard procedure without NLP tools – estimated 2-3
weeks of manual mining. With an NLP tool that extracts connectivity
information w/ graph visualization from full-text journal articles – 1
hour
• Case 2: determine toxicity patterns for a compound, or determine
toxicity side-effects of inhibiting a target. With manual OVID search
– library scientists have already put in 3 months, a total of a year
estimated. With NLP+ontologies (OBIIE) – 2-3 weeks.
• Case 3: An unknown protein is somehow linked to a known disease.
There is a lot of disease literature, but only 4 papers on the protein.
Establish a plausible connection of mechanism of action with this
disease. Without NLP – indefinite. With OBIIE – 2-3 weeks.
The Analyst’s Role
• Understand questions asked, problems
encountered
– Too much information
– Not enough information
– Relevant information is buried
• Match resources to needs
– Protein-centric versus pipeline?
– Better clinical or chemistry coverage?
• Know search logic and available tools
• Pre-screen end-user tools
The Analyst’s Role
• Link disparate resources for
improved coverage
• Repackage results to match
question, user preferences
• Never lose sight of user experience
– Alleviate tedium
– Minimize error
– Increase relevance
– Make them look good
• Raise awareness of previously
unanswerable questions
Drug Discovery & Due Diligence
Information Requirements
• Set up alerts/RSS feeds on company, compound,
clinical trial info, etc
• What’s in clinic for indication, trial info/protocols and
stage of trial
• Safety issues
• Potential alternative indications
• Biomarkers
• Toxicities of compounds for indication
• Potential consultants, collaboration map
• More comprehensive searches for research,
development, pharmacodynamics, clinical trials,
adverse events, etc.
Typical Text Mining Workflow
Using workflow technologies to build text mining applications
using finer grain components/services
Retrieval/ Storage
Text Processing
Feature Extraction
Data Mining
Indexing
Access Drivers
Storage
Stemming,
Stop-word filters,
Pattern filters,
Lexicon matching,
Ontologies,
NLP parsing
etc, ..
Statistical:
Word Counts,
Pattern Extraction &
Counts, etc
Classification,
Clustering,
Association,
Statistical
Analysis,
Visual Analysis,
etc …
Text
docs
Text
documents
Text
docs
Domain-specific
Gene Name counts,
etc
Numerical
Feature
Vectors
NLP-specific
Phrase counts, etc
Retrieve and
organize relevant
documents
Pre-process
documents to
enhance the ease of
feature extraction
Features are
summarized into
vector forms
which are suitable
for data mining
Results can be
document
characterization or
hidden relationship
extraction
Overview
• Collect
– Quosa
– Medline
• Explore
– Biovista
• Extract
– Linguamatics I2E
• Infrastructure
– KDE
Quosa
• Federated search/alerts
• Localize full-text papers
• Find information not found in abstracts (kinetic
parameters, experimental protocols, etc)
•
•
•
•
Manage literature
Collaborate
Analyze literature sets
Develop corpora for other applications to
analyze
Biovista
Interactive Co-occurrence Analysis
• Basic Research
–
–
–
–
–
–
Target expansion and off-target effects
Experimental design
Going fishing
Finding connections between known facts
Comprehensive summary of a research area
Collaboration
• Clinical Development
– Drug-Drug interactions
– Timeline studies
– Side effects to worry about
• Intellectual Property
– Analyze issued patents
• Competitive Intelligence
Linguamatics I2E
• Fact search engine
• Uses semantic entity types coupled with
syntactic search criteria for relationship
extraction
• Agile NLP application
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Inforsense KDE
• Text Mining Infrastructure
• Text/Data workflow environment
Use case 1: Where are early licensing
opportunities in academia?
Goal: identify areas of research that could yield
potential therapeutics
Criteria:
• some efficacy is established in the form of testing in
animal models
• Pre-IND filing
Approach: Survey the literature for papers that
describe in vivo testing of reagents that affect a
particular biology (eg immunity, neurology or tumor
growth)
Paint a picture of the desired target
• Use internal projects to develop search criteria
• Four early-stage projects each have 5-10 papers describing
neutralizing antibodies
• The papers mention an indication only half the time
• The papers always mention tissues and cell types
• Antibodies are described in a limited number of ways
• The target of the antibody is almost always in the same
sentence as the antibody term
• The ability of an antibody to block function is described in a
limited number of ways
Use the desired features to construct a
search
Antibody and protein terms in the same sentence
Block/neutralize and variations somewhere in the abstract
Nervous tissues somewhere in the abstract
“a neutralizing monoclonal antibody against IL-1 beta was infused into the
wound immediately following the injury”
“a neutralizing monoclonal antibody directed against MMP-9 was
administered intravenously”
“anti-rat neutralizing IL-1 beta antibody (anti-IL-1 beta) or control
immunoglobulin G antibody (IgG) was microinjected”
“potent blocking of p75 binding occurs only with MAb 909”
“an antibody that blocks erbB2/neu-mediated signaling inhibited vestibular
ganglion neuron viability”
Search Results
Use Case 2: the Gene List
–Generated by biomarker
studies, toxicity studies,
central to translational
medicine
Official
Name
BIIB name
Itga4
Tysabri
–Often hundreds of genes
Itgb1
Tysabri
–Official names are
obscure
Tnfsf13b
BAFF
–Finding all the names, the
most common name is
hard
Tdgf1
Cripto
Cd80
Galiximab
–On average, one a week
Fcer2a
Lumiliximab
A Literature Analytics Workflow
Gene Expression Analysis
Find Relevant Genes from
Online Databases
Find Associations between Frequent Terms
Visualizing search results and
information within yields new insights
• Paging through abstracts one by one
doesn’t show the big picture:
– Who’s collaborating with whom?
– Who’s patenting their work?
– When did the field develop and mature?
– Who are the opinion leaders?
1934 Author/Affiliations
8893 relations
QuickTime™ and a Blue = Aurora Kinases
TIFF (Uncompressed) decompressor
Green = Cancer lit
are needed to see this picture.
Red = Patents lit
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Where do we need to be?
• Spend less time acquiring, more time assimilating
• Provide domain experts with powerful literature
analytics
• Mix/match best of breed applications for combining
text/data mining
• Need knowledge discovery/exploitation
environment that supports rapid construction of
integrated text/data results for researchers
Acknowledgements
•
•
•
•
•
•
•
•
Connie Matsui
June Ivey
Pam Gollis
Harry Bochner
Adrean Andreas
Cindy Shamel
Steve French
Research Informatics