Knowledge-powered Next Generation Scholar Engines AAAI 2015: Scholarly Big Data: AI Perspectives, Challenges, and Ideas Workshop Yang Song, Researcher Microsoft Research, Redmond 01/25/2015
Download ReportTranscript Knowledge-powered Next Generation Scholar Engines AAAI 2015: Scholarly Big Data: AI Perspectives, Challenges, and Ideas Workshop Yang Song, Researcher Microsoft Research, Redmond 01/25/2015
Knowledge-powered Next Generation Scholar Engines AAAI 2015: Scholarly Big Data: AI Perspectives, Challenges, and Ideas Workshop Yang Song, Researcher Microsoft Research, Redmond 01/25/2015 Topics • Microsoft Academic and Recent Advances • Bing Dialog Model and Knowledge Graphs • Knowledge-powered Academic Search Microsoft Academic and Recent Advances search Embed Author network author Citing Papers Compare Domain Trends organization Ranking Lists Call for Papers conferences Microsoft Academic Search data comes from open access repositories, publishers, and web crawls – Currently ~80 M (up from ~50M in 2013) papers across 14 domains • New engagement with publisher • Better discovery algorithm from Bing index – Tight integration with Bing and knowledge bases MAS Attributes Papers (~80M) Derived entities: Enrichment from web: Authors (25M) Journals (23K) Conferences (27K) Organizations (770K) Field of Study (53K) Author homepages Author Images Geolocations Academic Graph MAS Workflow Open Repository Publisher Feeds Web Crawls PDF Analyzer Applications & Integrations Meta Field Extractor Data Quality Checker Conflation Services Conflation Services Big Data Challenge • No Guarantee of Data Quality • Garbage characters: “Eric Vivier¶¶” • Incomplete metadata: • Missing co-author info • Missing affiliation info • truncated titles • Ill-formatted data: • “Alan (1880-1960) Smith” • “Alan Smith Stanford University” • Duplicated entries • Spelling Errors • And many more… Big Data Challenge • Creates difficulty in conflation • • • • Title conflation Institution conflation Venue conflation Author conflation Efficient Topic-based Unsupervised Name Disambiguation. Song, Huang, Councill, Li, and Giles. JCDL'07. • Content-based vs. non-content based methods • Sophisticated ML models, small & complete data • Manually crafted models, very Big & incomplete data Big Data Issue • Creates difficulty in conflation • • • • Title conflation Institution conflation Author conflation Etc. Efficient Topic-based Unsupervised Name Disambiguation. Song, Huang, Councill, Li, and Giles. JCDL'07. • Content-based vs. non-content based • Sophisticated ML models, Small data • Manually crafted models, very Big data Production (reality) Bing Dialog Model SEARCH HIT OR MISS MODEL BING DIALOG MODEL Lexical Indexing and Matching Knowledge Computing & Contextual Intent Matching Push with passive interaction Pull & Push with proactive interaction Relevance of URLs Relevance of URLs & Entities from minimizing time of query/URL matching to efforts to completing tasks Bing Dialog Model – A stateful feedback system Knowledge & Memory Previous Inferences (K, It - 1) Expected Behavior ++ - Inferred Intent (It) Intent Model User Behavior (Ut) Bayesian Minimum Risk framework It = arg max P(I | Ut, K, It-1) At = arg min E[Cost(A, It )] Inferred Intent (It ) User Behavior Observer Interaction Model Inferred Action (At) Bing Dialog Model implementation Document-level dialog Session-level Dialog Query-level Dialog Page-level Dialog Going beyond documents – Entities Building deep understanding of user tasks Going beyond documents – Entities Enabling user task completion The QED Framework Understand Queries, Entities and Documents arg max 𝐸[𝑅(𝐷1 , … 𝐷𝑛 )|𝑄] (10 blue links optimization) Query Document arg max 𝐸[𝑅(𝐷1 , … 𝐷𝑛 , 𝐸1 , … 𝐸𝑚 )|𝑄] (Docs + Entities joint optimization) Query Document Entity Bing Dialog Measurements • User Satisfaction • • • • Session Success Rate (SSR) Time to Success (TTS) “Modeling action-level satisfaction…” (SIGIR 2014) “Struggling or exploring? Disambiguate search sessions” (WSDM 2014) • User Engagement • “Context aware web search abandonment prediction” (SIGIR 2014) • “Evaluating and predicting user engagement…” (WWW 2013) Knowledge-powered MAS Current MAS is a self-contained graph Papers Authors Journals Conferences Organizations Field of Study Author homepages Author Images Geolocations Index Academic is now a subset of larger graph (Map to the Web Presence, Wiki, LinkedIn) Papers Authors Journals Conferences Organizations Field of Study Author homepages Author Images Geolocations Academic Graph Knowledge-powered Scholar Engines • Knowledge bases: • • • • Mostly from “high quality” structured data feeds Lots of manual efforts in ingestion, conflation, indexing and serving Errors, holes, outdated information abound Guess the size of Bing KB? • Marrying knowledge bases with scholar engines: • • • • • Entity linking between KB and DL Correct and complete missing entity types Detect new entities Cross-link to crowd-sourced, “noisier” knowledge bases Extract knowledge from NL documents MAS Powered by Bing Dialog Engine • Can answer very complicated queries • Papers by [Author] after [Year] in [FOS] MAS Powered by Bing Dialog Engine • Can answer very complicated queries • Papers by [Author] after [Year] in [FOS] • Papers citing [Author] about [FOS] MAS Powered by Bing Dialog Engine • Can answer very complicated queries • Papers by [Author] after [Year] in [FOS] • Papers citing [Author] about [FOS] • Papers by [Author] and [Author] Recommendation Services (In Production) Papers From Desktop Search to Mobile Personal Assistant (Microsoft Cortana) Notebook Settings Publication about Deep Learning ICML reminder settings Counting down to submission deadline Notify me about Notification Due Main Conference and workshop academic Submission Deadline deep learning Latest publications ICML Reminder setting Geoffrey Hinton Latest publications Added or inferred Challenges (and Opportunities) Ahead • Research Community Engagement • Disambiguation (KDD Cup 2013) • Measuring & predicting impact • Impact within domains • Scaling out entity linking • Opportunities to partner on research • Data? Challenges? Okay really Thank you Yang Song, Microsoft Research [email protected] http://research.microsoft.com/people/yangsong/