Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS.
Download
Report
Transcript Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS.
Chemical Entity extraction using the
chemicalize.org-technology
Josef Scheiber
Novartis Pharma AG – NITAS/TMS
Where the story of this project started ...
A day in October 2008
Some time around 7:45
in the morning ...
Novartis Campus
Dreirosenbrücke
Vision for textmining
Integration chemical, biological knowledge
Mining for Chemical Knowledge - Rationale
- Make text corpora searchable for chemistry
- Generate chemistry databases for use in research based
on Scientific Papers or Patents
- Link Chemical Information with further annotation in an
automated way for e.g. Chemogenomics applications
- Patent analyis for MedChem projects
Connection table
Mining for chemical Knowledge - Rationale
Information on compounds
targeting GPCRs
HELP
Information
explosion
Source: Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42
Example:
Project Prospect – Royal Society of Chemistry
Enhancing Journal Articles with Chemical Features
This helps you identifying other articles
talking about the same molecule
Mining for Chemical Knowledge – Focus for today
- Make text corpora searchable for chemistry
- Generate chemistry databases for use in research based
on Scientific Papers or Patents
- Link Chemical Information with further annotation in an
automated way for e.g. Chemogenomics applications
- Patent analyis for MedChem projects
Connection table
A use case for successful patent mining
(molecules you sometimes find in your inbox ;-) )
Sildenafil
(1998, Pfizer) –
€ 11.7 billion
(USD 15.1 billion)
Slide inspired by an example from Steve Boyer/IBM;
Sales data from Prous Integrity datase
Vardenafil
(2003, Bayer) –
€ 1.24 billion
(USD 1.6 billion)
Conventional Database Building
Facts – current standard
... (ACS) owes most of its wealth to its two 'information
services' divisions — the publications arm and the
Chemical Abstracts Service (CAS), a rich database of
chemical information and literature. Together, in 2004,
these divisions made about $340 million — 82% of the
society's revenue — and accounted for $300 million (74%)
of its expenditure. Over the past five years, the society has
seen its revenue and expenditure grow steadily ...
Source: ACS homepage
Facts
Established application
Straighforward use
De-facto Gold standard
Unique data source
Very costly
No structure export for reasonable price
Very limited in large-scale follow-up analysis
Most recent patents not available
Not data (search), but integration, analysis and
insight, leading to decisions and discovery
Now – What would be the perfect solution?
All patent offices require to
provide all claimed structures
as machine-readable version
available for one-clickdownload
Text extraction
Definition:
Extract all molecules that
are mentioned in a patent
text of interest, convert
them to structures and
make them available in
machine-readable format
Mining for Chemical Knowledge
Technologies from providers
Text entity recognition
Image recognition
(a) Extractors (IUPAC names)
- TEMIS Chemical Entity
Relationships Skill Cartridge
- Accelrys Pipeline Pilot extractor
(Notiora)
- Fraunhofer (ProMiner Chemistry)
- Chemaxon (chemicalize.org)
- Oscar (Corbett, Murray-Rust et al.)
- SureChem
- IBM ChemFrag Annotator
- OSRA (NIH)
(b) Converter
(Names connection table)
- CambridgeSoft name=struct
- Openeye Lexichem
- Chemaxon
- Clide Pro (Keymodule Ltd.)
- Fraunhofer chemoCR
- ChemReader
The objective
To provide a tool that provides sophisticated
text analysis methods for NIBR scientists and
thereby leverages the methods of TMS
Mining for Chemical Knowledge – Novartis Tools – the
chemicalize-technology is working under the hood!
Clipboard Analysis
Identified
structures
Patent
text
View structure
onMouseOver
Export to
other
applications
Mining for Knowledge – Novartis Tools
Input example: J Med Chem Paper
Mining for Chemical Knowledge – Use Case
Medicinal Chemist wants to synthesize competitor
compound as tool compound for own project
This enables the identification
of compounds most
representative for a
Identification
competitor
patent
of core
scaffold
Analysis of
substitution
patterns
Example – A text-based patent
A patent example
Automated
Text
extraction
452
compounds
Reference
636 compounds
71%
Example – An image-base patent
Text extraction not suitable for this case, it does find only a
meager 40 molecules, 1129 in reference – Why?
An entirely image-based patent example
Language issues – e.g. Japanese patents
Encountered problems
OCR (Optical Character Recognition)!!
USPTO and WIPO are now available full text in most cases
Typos!
Name2Struct problems (less an issue here)
IBM initiative
Patent Mining / ChemVerse database (Steve Boyer)
The objective is to automatically extract all molecules from
all patents available and make them searchable in a
database
They leverage cloud computing and have access to all fulltext patents
This is going absolutely the right direction
They annotate the molecules with information from freely
available databases
Future ideas: Patent Analysis
Markush translation, Image+Target
Ranking capabilities of outcome for User
„blurred“ dicos for translating stuff like aryl, cycloalkyl etc.
Select annotate as entity on the fly error-correction
Result goes in a database Crowdsourcing efforts to
improve and store results
Suggest functionality
To enable true Patinformatics analyses ...
Definition by Tony Trippe:
Acknowledgements
NITAS/TMS
Therese Vachon
Daniel Cronenberger
Pierre Parisot
Martin Romacker
Nicolas Grandjean
Alex Fromm
Katia Vella
Olivier Kreim
Clayton Springer
Naeem Yusuff
Bharat Lagu
And many other people in different divisions of NIBR for their support