Directions in Open Science Mike Travers SRI Bioinformatics Research Group For AIC Lunch and Learn, 30 Jan 2012

Download Report

Transcript Directions in Open Science Mike Travers SRI Bioinformatics Research Group For AIC Lunch and Learn, 30 Jan 2012

Directions in Open Science
Mike Travers
SRI Bioinformatics Research
Group
For AIC Lunch and Learn, 30 Jan 2012
About this talk
• Partly a trip report from Open Science
Summit 2011
• Partly an attempt to define open science
and explore its impact
• Partly an excuse to talk about some of my
own vaguely related work
• And partly some semi-crazy speculation
about future projects in this space
The Open Science Summit unites researchers, life science
industry professionals, students, patients and other
stakeholders to discuss the future of collaborative science
and innovation.
…in-depth sessions on new models for drug discovery and
clinical trials, personal genomics, the patent system, the
future of scientific publications, and more.
What is Open Science?
• Many different things, but boils down to:
• Removing barriers to scientific
communication and collaboration:
–
–
–
–
–
Social
Technical
Legal
Economic
Bureaucratic
• To accelerate scientific progress
• Utilizing modern technology
Driven by technological change
• The Internet has radically reduced
communication costs
• So old institutions of scientific
communication are now obstacles
– Closed academic publishers, notably:
• Internet will transform scientific media just
like it has newspapers, TV, social life….
• The difference is: science is more
important than sharing cat pictures
For-profit
academic
publishing is a
racket
A very lucrative
one
Starting to be
rumbles of
complaint
(boycotts) from
academics
Open
• Most visible and successful branch
of open science
• Articles are free to read, pay
to publish
• Funders are starting to require
some form of public access
Gold: OA journal, Green: OA self-archiving
Open Access to the Scientific Journal Literature: Situation 2009, PLoS ONE, Bo-Christer Björk et al
Research Works Act
• H.R.3699 – “A bill to ensure the continued
publication and integrity of the peer-reviewed
research works by the private sector.”
No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any
policy, program, or other activity that-(1) causes, permits, or authorizes network dissemination of any private-sector research
work without the prior consent of the publisher of such work; or
(2) requires that any actual or prospective author, or the employer of such an actual or
prospective author, assent to network dissemination of a private-sector research work.
Myth 1: American consumers have a right to free access to articles their tax
dollars fund.
Fact
American taxpayers do not fund peer reviewed research articles; they fund some
of the research that is used in those articles…
Beyond Open Access
•
•
•
•
Not going to say a whole lot about OA, because:
It’s easy to understand
It’s pretty clearly going to win in the long term
By itself, not a very radical change to how science is
done:
– Knowledge is still in paper-sized chunks
– Papers are peer-reviewed prior to publication;
– Once something is published, it’s static
• All these parameters are being challenged in some
way by other efforts
• George Whitesides (Harvard chemist): “The concept
of the scientific paper is eroding before our very
eyes”
Variations on publishing
• “Peer review is broken”
–
–
–
–
Too slow
Too biased
Too rigid
May be “the worst system except for all the others”
• Pre-peer-review publication
– Eg arXiv.org
• Micropublication
– Crowdsourcing, blogs, wikis….
• Open-notebook science
– No gap at all between bench and publication
• Database-linked publications
• Dynamic Review Papers
Biggest sequencing operation
in the world
Generating 6 terabytes/day of
genomic data
Open-Source Genomic Analysis of Shiga-Toxin–Producing E. coli
O104:H4 Rohde et al 2011 (NEJM)
Toxic E. coli outbreak in Germany May 2011:
We released these data into the public domain… which elicited a burst of crowdsourced, curiosity-driven analyses carried out by bioinformaticians on four
continents. Twenty-four hours after the release of the genome, it had been
assembled; … Five days after the release of the sequence data, we had designed and
released strain-specific diagnostic primer sequences, and within a week, two dozen
reports had been filed on an open-source wiki …dedicated to analysis of the strain
https://github.com/ehec-outbreak-crowdsourced
GigaScience is a new integrated database and journal
co-published in collaboration between BGI Shenzhen
and BioMed Central, to meet the needs of a new
generation of biological and biomedical research as it
enters the era of "big-data."
Dynamic Review Papers
Conventional paper
Paired with
Dynamicallyupdated,
wiki-based
paper/database/mo
del
Driving apps
Who comes to Open Science
Summits?
Activist Organizations
Participatory Medicine
& Disease Foundations
Startups
Social paper and citation
management
Scientific services
marketplace
Web-based molecule
library management
Citizen Science
Somewhat less garage• Independent research institute, started
from data released by Merck
• Repository of experimental data (Sage
Commons)
• Network of cooperating institutions
• Starting to build a computational platform
(Synapse)
Synthetic Biology
And some individual
researchers
• Peter Murray-Rust
Chemist, Cambridge,
promoter of Chemical Markup Language and
semantic web
“Closed science makes people die!”
• Victoria Stodden
Statistician, Columbia,
reproducibility of computational science
(cf ClimateGate)
•
•
•
•
Some open science success
stories
Galaxy Zoo
FoldIt
Nutrient Network (NutNet)
Prazinquantel synthesis
Galaxy Zoo
• Citizen science (loosely)
• Image classification task
• Mechanical Turk-like approach (but
unpaid)
• About 200K participants
• Discovered a whole new class of galaxies
(“green pea”) and a quasar mirror
• 22 published papers in 3 years
Social sharing of algorithms (“recipes”)
Descent with modification
Matthew Todd, chemist at U
of Syndney
Schistosmiasis
Looking for synthesis for
known drug Prazinquantel
(PZQ) in enantiopure form
Open-notebook science
(LabTrove)
Nutrient Network (NutNet)
What paper has the most
authors?
• NutNet paper:
40 authors, 41 institutions
• This one from SLAC and elsewhere:
407 authors, but only 35 institutions
Three variations on the scientific
process
• Automated Science
• Distributed Science
• Web-scale Intelligent Science
• Open Science as the lubrication /
accelerant that makes these feasible
Afferent: Automation for Drug
Discovery
• Combinatorial Chemistry
• Planning software to drive lab robots
Distributed Science
• Some science (eg evaluation of drug
candidates) is highly parallelizable,
• Hence distributable
• CollabRx was initially supposed to support
“virtual pharma companies” that would tie
disparate academic research efforts into
focused teams
Web-scale Intelligent Science
• Imagine all of science as a giant distributed
computational process
• Individual scientists are agents
– working on a small part of the problem
– Sharing their results
– Getting feedback and funding dependent on
success
• Centralized data integration and decision
tools used to help determine next useful
experiment
Steps towards distributed
intelligence
• Adaptive clinical trials
– Rather than a classical trial with two arms run to
completion
– Change the distribution of test cases based on
ongoing results
• Now imagine this strategy applied more globally
across all treatments for a disease
• Credit for this slightly mad vision goes mainly to
Marty Tenenbaum:
– AI Meets Web 2.0 (2006)
– Shrager, Tenenbaum, Travers, Cancer Commons:
Biomedicine in the Internet Age (2011)
What does all that have to do with
Open Science?
• Open Science is lowering barriers to
collaboration
• So it’s a necessary but not sufficient step
towards this new kind of science
• CollabRx may just have been too early:
– the groundwork hasn’t been laid yet,
– we are still working on basics
– (eg standards for representation)
• Reducing friction (or transaction costs) can
be incredibly important
“Changing the cost of innovation
fundamentally changes the nature of
innovation”
– Joichi Ito
TCP, HTTP etc are the
containerization of
data.
So what’s the analog
for scientific
knowledge?
Standardized Legal and
Institutional Mechanisms
A mix of technical,
institutional, and legal
standardization:
-Standard licenses
(parameterizable)
-RDF representation for
licenses.
-Web Tools to generate
these
-Sites that collect and
“market” available
materials.
BioBike, a platform for open
science
• Conceived of as a vehicle for getting
biologists to do their own knowledgebased biocomputing.
• Lisp + Frame system + Bioinformatics
Tools
– Through-the-web programmability
– Community sharing of code and data
– Visual Programming Language
• Open Source
•
Jeff Elhai, Arnaud Taton, J. P. Massar, John K. Myers, Michael Travers, Johnny Casey, Mark
Slupesky, Jeff Shrager. BioBIKE: A Web-based, programmable, integrated biological knowledge
base. Nucleic Acids Research, 2009
BioBike and Open Science
• BioBike wasn’t for Open Science per se
• But it did explore some ideas in webbased biocomputation
• The next-generation BioBike platform:
– Data: Big data, Open data, semantic web
integrated
– Programming: Able to deal with large scale
and distributed workflows with human
elements
– Collaboration:
different
KnowOS:
The (Re)BirthIntegrating
of the Knowledge
Operating
“tradingand
zone”
System.communities
Mike Travers, in
JPaMassar,
Jeff Shrager,
International Lisp Conference 2005
What is a platform?
• The economic meaning of “platform” is interesting
• Something that:
– Supports two-sided network effects
– Stands in the middle and extracts a toll
• Examples:
– Credit cards
(merchants ↔ consumers)
– Operating systems
(application developers ↔ users)
• Science has more complicated networks and relations
–
–
–
–
–
Data providers
Data consumers
Service providers
Analysts (statisticians, eg)
Patients
• A science platform is not going to make anyone rich like Facebook,
but it would be nice to have a powerful and standard way for all
these groups to collaborate.
Open Data is outstripping analysis
capacity
• Or in other words:
– data is cheap,
– attention, knowledge, & expertise are
expensive
• A platform for collaborative computational
interpretation of biological data
• To better leverage the expensive
resources
identifies advancing new computational infrastructure as a
priority for driving innovation in science and engineering.
Scientific discovery and innovation are advancing
along fundamentally new pathways opened by the
development of increasingly sophisticated software.
the overarching goal of transforming
innovations in research and education into
sustained software resources that are an
integral part of the cyberinfrastructure
Anti-open arguments
• Peer-review is an essential filter; without it
too much nonsense gets out
• Electronic availability of articles actually leads
to narrowing of science (Evans, 2008)
• Privacy, HIPAA, etc.
• Need to retain IP for economic motivation
• The problem isn’t availability of data; it’s
making sense of what we do have
• See PRISM for more
Opener Science
• Science is already
pretty open!
• institutions of openness
played a role in the
foundation of science,
including the first
scientific journals
Historical Origins of Open
Science
• Before the invention of science,
knowledge of the natural world was closely
guarded, passed down from master to
apprentice.
• The development of institutions of
openness was a key factor in the scientific
revolution (Paul David, Stanford
economist)
• …and the printing press was a key factor
in that.
So…
• The printing press is almost 600 years old
• The scientific journal is almost 350 years old
• There’s been some advancement in
communication technology since then…
• Science will eventually change:
– Either a modest acceleration of the scientific
process,
– Or as significant and discontinuous as the first
scientific revolution
• Which one? An open question.
Further Reading
End