The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008
Download ReportTranscript The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008
The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008 2 Patterns of gene annotation How do we efficiently annotate the function of the ~25,000 genes in the mammalian genome? Goal: “Genome-wide functional genomics” log(# genes) 1 2 3 log(# genes) 1 2 3 P(k) ~ k -a 4 4 Entrez Gene 0.0 1.0 2.0 3.0 0 log(# references) 1.5 a = -0.6 R squared = 0.894 0 0 a = -1.32 R squared = 0.963 1 2 3 log(# references) s) .5 3.0 44% of genes in Entrez Gene have zero linked references. Over 75% have five or fewer linked references. s) 0 4 5 3 The Long Tail of Knowledge • Traditional media revolves around the Short Head – a few number of publishers putting out lots of content The Short Head The Long Tail Newspapers TV/Hollywood Consumer Reports Olympics Encyclopedia Britannica Blogs YouTube Amazon reviews American Idol Wikipedia Content • “Web 2.0” media revolves around community generated content – a huge population of individuals each generating a (relatively) small amount of content Users “Community intelligence” 4 The Long Tail of encyclopedias • Wiki: “… a website that allows the visitors themselves to easily add, remove, and otherwise edit and change available content, typically without the need for registration.” • Wikipedia: “the free encyclopedia that anyone can edit.” “ An expert-led investigation carried out by Nature … revealed numerous errors in both encyclopaedias, but among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three. W ik ip e d ia B rita n n ic a O n lin e Artic le s W o rd s (m illio n s ) Av e ra g e w o rd s / a rtic le > 2 ,0 0 0 ,0 0 0 > 1 ,0 0 0 435 1 2 0 ,0 0 0 55 370 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 5 Advantages of a Gene Wiki 1) Existing gene portals are great for structured content, but a wiki is suited for summarizing unstructured content Entrez Gene Wikipedia Unstructured content allows for free-text, images, diagrams, photos, etc. 6 Advantages of a Gene Wiki 2) Wiki articles enable two-way communication of information, encouraging contributions and edits from the community. Jan 3, Dec May 18, 11, 6,2004 2006 2004 2002 Wikipedia is rarely the last place you look, but is often a good first place for an overview. Gene “stubs” • Active MCB community at WP had already developed ~650 gene articles • Can we accelerate this process through stub creation? • In total, created 7500 new articles and edited 650 previously existing articles. 7 8 Why Wikipedia? • Critical mass of articles to which and from which we could link gene pages • Critical mass of editors who were experienced in wikirelated issues (fighting vandalism, copyediting, governance) • Active group of molecular biologists at the MCB “WikiProject” (http://en.wikipedia.org/wiki/WP:MCB) • Alternatives considered – Home-built wiki – Citizendium (citizendium.org) 9 Gene wiki usage Current have ~9000 gene pages or stubs at Wikipedia (650) (7500) 50% of all edits to gene pages are to newly-created pages… Gene Wiki pages are highly ranked at Google, ensuring critical mass of users and editors… 10 Positive feedback loop Gene wiki page utility 1 2 Number of editors 100 200 Number of readers 11 25k gene-specific review articles? Reelin: 33 editors, 221 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 Hyperlinks to related concepts 0 11/14/08 11/12/08 11/10/08 11/8/08 11/6/08 11/4/08 11/2/08 10/31/08 10/29/08 2000 10/27/08 4000 10/25/08 Gene Wiki Monthly Activity (May 07 - Nov 08) 10/23/08 8000 10/21/08 10000 10/19/08 6000 # edits 12000 10/17/08 Nov-08 Oct-08 Sep-08 Aug-08 Jul-08 Jun-08 May-08 Apr-08 Mar-08 Feb-08 Jan-08 Dec-07 Nov-07 Oct-07 Sep-07 Aug-07 Jul-07 Jun-07 May-07 # edits 12 Gene Wiki activity Steady (and growing?) edit rate over time Gene Wiki Daily Activity (Oct 17 - Nov 14) 160 140 120 100 80 60 40 20 0 13 Gene Wiki article growth http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/gene-wiki-top-2500-20081114 Welcome to the semantic web… “ The main concern with plaintexton-Wikipedia is that it's not an effective way to truly exploit the long tail, since you're going to end up with this massive plaintext disaster that will require human collating (redundant work- just get it right the first time).” - public-semweb-lifesci mailing list 14 15 Primary emphases • Providing useful content – scientists will not find or contribute to a wiki unless it is already useful • Instant feedback – wikis allow changes to be effective immediately, without approval or intermediary (e.g., corrections/additions to NCBI/Ensembl?) • Emphasis on contributors, not data miners – emphasize getting data in, not on getting it out, since complex protocols encourage nonparticipation (e.g., MIAME) • Critical mass – What will differentiate the Gene Wiki from the many other wiki efforts that are stagnant? 16 Secondary emphases • Reliability and accuracy – do open and uncurated data models produce trustworthy content? • Synergy with existing resource – how can the Gene Wiki make the growth of traditional annotation more efficient? • Enabling semantic queries/structure – how can we structure unstructured content for data mining? (Semantic Mediawiki? NLP?) 17 Idealized information flow “Long tail” scientific contributions 2 Direct semantic annotation by scientists Unstructured content from the community Wikipedia 1 3 Create Gene Wiki stubs NCBI Ensembl … Authoritative annotation databases Semantic encoding of free text (how?) Semantic structure 18 Figure to scale? “Long tail” scientific contributions Wikipedia NCBI Ensembl … Semantic structure 19 Summary • Goal: create a complementary resource to existing tools, not competitive. • Primary emphasis will always be on maximizing community participation. • How do we structure the unstructured contributions? 20 Acknowledgements Serge Batalov Jason Boyer Jennifer Floyd Yue Hu Jon Huss Jeff Janes Camilo Orozco Steve Su Julia Turner Chunlei Wu David Delano James Goodale Phil McClurg Richard Trager Faramarz Valafar, SDSU Tim Vickers, Washington Univ Michael Cooke Pete Schultz Funding: NIGMS, NIH; Novartis Research Foundation