The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008

Download Report

Transcript The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008

The Gene Wiki, from a
BioRDF-naïve perspective
W3C / HCLSIG
BioRDF Subgroup
November 17, 2008
2
Patterns of gene annotation
How do we efficiently annotate the function of the ~25,000
genes in the mammalian genome?
Goal: “Genome-wide functional genomics”
log(# genes)
1
2
3
log(# genes)
1
2
3
P(k) ~ k -a
4
4
Entrez Gene
0.0
1.0
2.0
3.0
0
log(# references)
1.5
a = -0.6
R squared = 0.894
0
0
a = -1.32
R squared = 0.963
1
2
3
log(# references)
s)
.5
3.0
44% of genes in Entrez Gene have zero linked references. Over 75%
have five or fewer linked references.
s)
0
4
5
3
The Long Tail of Knowledge
• Traditional media revolves around the Short Head – a few
number of publishers putting out lots of content
The Short Head
The Long Tail
Newspapers
TV/Hollywood
Consumer Reports
Olympics
Encyclopedia Britannica
Blogs
YouTube
Amazon reviews
American Idol
Wikipedia
Content
• “Web 2.0” media revolves around community generated content –
a huge population of individuals each generating a (relatively)
small amount of content
Users
“Community intelligence”
4
The Long Tail of encyclopedias
• Wiki: “… a website that allows the visitors themselves to easily add, remove,
and otherwise edit and change available content, typically without the need for
registration.”
• Wikipedia: “the free encyclopedia that anyone can edit.”
“
An expert-led investigation carried out by
Nature … revealed numerous errors in
both encyclopaedias, but among 42
entries tested, the difference in accuracy
was not particularly great: the average
science entry in Wikipedia contained
around four inaccuracies; Britannica,
about three.
W ik ip e d ia
B rita n n ic a O n lin e
Artic le s
W o rd s (m illio n s )
Av e ra g e w o rd s / a rtic le
> 2 ,0 0 0 ,0 0 0
> 1 ,0 0 0
435
1 2 0 ,0 0 0
55
370
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
5
Advantages of a Gene Wiki
1) Existing gene portals are great for structured content, but a
wiki is suited for summarizing unstructured content
Entrez Gene
Wikipedia
Unstructured content allows for free-text,
images, diagrams, photos, etc.
6
Advantages of a Gene Wiki
2) Wiki articles enable two-way communication of information,
encouraging contributions and edits from the community.
Jan 3,
Dec
May
18,
11,
6,2004
2006
2004
2002
Wikipedia is rarely the last place you look, but is often
a good first place for an overview.
Gene “stubs”
• Active MCB community
at WP had already
developed ~650 gene
articles
• Can we accelerate this
process through stub
creation?
• In total, created 7500
new articles and edited
650 previously existing
articles.
7
8
Why Wikipedia?
• Critical mass of articles to which and from which we
could link gene pages
• Critical mass of editors who were experienced in wikirelated issues (fighting vandalism, copyediting,
governance)
• Active group of molecular biologists at the MCB
“WikiProject” (http://en.wikipedia.org/wiki/WP:MCB)
• Alternatives considered
– Home-built wiki
– Citizendium (citizendium.org)
9
Gene wiki usage
Current have ~9000 gene pages or stubs at Wikipedia
(650)
(7500)
50% of all edits to gene pages
are to newly-created pages…
Gene Wiki pages are highly
ranked at Google, ensuring
critical mass of users and
editors…
10
Positive feedback loop
Gene wiki page utility
1
2
Number of
editors
100
200
Number of
readers
11
25k gene-specific review articles?
Reelin: 33 editors, 221 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
Hyperlinks to related concepts
0
11/14/08
11/12/08
11/10/08
11/8/08
11/6/08
11/4/08
11/2/08
10/31/08
10/29/08
2000
10/27/08
4000
10/25/08
Gene Wiki Monthly Activity
(May 07 - Nov 08)
10/23/08
8000
10/21/08
10000
10/19/08
6000
# edits
12000
10/17/08
Nov-08
Oct-08
Sep-08
Aug-08
Jul-08
Jun-08
May-08
Apr-08
Mar-08
Feb-08
Jan-08
Dec-07
Nov-07
Oct-07
Sep-07
Aug-07
Jul-07
Jun-07
May-07
# edits
12
Gene Wiki activity
Steady (and growing?) edit rate over time
Gene Wiki Daily Activity
(Oct 17 - Nov 14)
160
140
120
100
80
60
40
20
0
13
Gene Wiki article growth
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/gene-wiki-top-2500-20081114
Welcome to the semantic web…
“
The main concern with plaintexton-Wikipedia is that it's not an
effective way to truly exploit the
long tail, since you're going to end
up with this massive plaintext
disaster that will require human
collating (redundant work- just get
it right the first time).”
- public-semweb-lifesci mailing list
14
15
Primary emphases
• Providing useful content – scientists will not find or
contribute to a wiki unless it is already useful
• Instant feedback – wikis allow changes to be
effective immediately, without approval or intermediary
(e.g., corrections/additions to NCBI/Ensembl?)
• Emphasis on contributors, not data miners –
emphasize getting data in, not on getting it out, since
complex protocols encourage nonparticipation (e.g.,
MIAME)
• Critical mass – What will differentiate the Gene Wiki
from the many other wiki efforts that are stagnant?
16
Secondary emphases
• Reliability and accuracy – do open and uncurated
data models produce trustworthy content?
• Synergy with existing resource – how can the Gene
Wiki make the growth of traditional annotation more
efficient?
• Enabling semantic queries/structure – how can we
structure unstructured content for data mining?
(Semantic Mediawiki? NLP?)
17
Idealized information flow
“Long tail” scientific
contributions
2
Direct
semantic
annotation by
scientists
Unstructured content
from the community
Wikipedia
1
3
Create Gene Wiki stubs
NCBI
Ensembl
…
Authoritative annotation databases
Semantic encoding
of free text (how?)
Semantic structure
18
Figure to scale?
“Long tail” scientific
contributions
Wikipedia
NCBI
Ensembl
…
Semantic structure
19
Summary
• Goal: create a complementary resource to
existing tools, not competitive.
• Primary emphasis will always be on
maximizing community participation.
• How do we structure the unstructured
contributions?
20
Acknowledgements
Serge Batalov
Jason Boyer
Jennifer Floyd
Yue Hu
Jon Huss
Jeff Janes
Camilo Orozco
Steve Su
Julia Turner
Chunlei Wu
David Delano
James Goodale
Phil McClurg
Richard Trager
Faramarz Valafar, SDSU
Tim Vickers, Washington Univ
Michael Cooke
Pete Schultz
Funding: NIGMS, NIH; Novartis Research Foundation