Knowledge-powered Next Generation Scholar Engines AAAI 2015: Scholarly Big Data: AI Perspectives, Challenges, and Ideas Workshop Yang Song, Researcher Microsoft Research, Redmond 01/25/2015

Download Report

Transcript Knowledge-powered Next Generation Scholar Engines AAAI 2015: Scholarly Big Data: AI Perspectives, Challenges, and Ideas Workshop Yang Song, Researcher Microsoft Research, Redmond 01/25/2015

Knowledge-powered Next
Generation Scholar Engines
AAAI 2015: Scholarly Big Data: AI Perspectives,
Challenges, and Ideas Workshop
Yang Song, Researcher
Microsoft Research, Redmond
01/25/2015
Topics
• Microsoft Academic and Recent Advances
• Bing Dialog Model and Knowledge Graphs
• Knowledge-powered Academic Search
Microsoft Academic
and Recent Advances
search
Embed
Author
network
author
Citing Papers
Compare
Domain
Trends
organization
Ranking Lists
Call for
Papers
conferences
Microsoft Academic Search data comes from open
access repositories, publishers, and web crawls
– Currently ~80 M (up from ~50M in 2013) papers
across 14 domains
• New engagement with publisher
• Better discovery algorithm from Bing index
– Tight integration with Bing and knowledge bases
MAS Attributes
Papers (~80M)
Derived entities:
Enrichment from web:
Authors (25M)
Journals (23K)
Conferences (27K)
Organizations (770K)
Field of Study (53K)
Author homepages
Author Images
Geolocations
Academic
Graph
MAS Workflow
Open
Repository
Publisher Feeds
Web Crawls
PDF
Analyzer
Applications &
Integrations
Meta Field Extractor
Data Quality Checker
Conflation
Services
Conflation
Services
Big Data Challenge
• No Guarantee of Data Quality
• Garbage characters: “Eric Vivier¶¶”
• Incomplete metadata:
• Missing co-author info
• Missing affiliation info
• truncated titles
• Ill-formatted data:
• “Alan (1880-1960) Smith”
• “Alan Smith Stanford University”
• Duplicated entries
• Spelling Errors
• And many more…
Big Data Challenge
• Creates difficulty in conflation
•
•
•
•
Title conflation
Institution conflation
Venue conflation
Author conflation
Efficient Topic-based Unsupervised Name Disambiguation. Song, Huang, Councill, Li, and Giles. JCDL'07.
• Content-based vs. non-content based methods
• Sophisticated ML models, small & complete data
• Manually crafted models, very Big & incomplete data
Big Data Issue
• Creates difficulty in conflation
•
•
•
•
Title conflation
Institution conflation
Author conflation
Etc.
Efficient Topic-based Unsupervised Name Disambiguation. Song, Huang, Councill, Li, and Giles. JCDL'07.
• Content-based vs. non-content based
• Sophisticated ML models, Small data
• Manually crafted models, very Big data
Production (reality)
Bing Dialog Model
SEARCH
HIT OR MISS MODEL
BING
DIALOG MODEL
Lexical Indexing and Matching
Knowledge Computing &
Contextual Intent Matching
Push with passive interaction
Pull & Push with proactive interaction
Relevance of URLs
Relevance of URLs & Entities
from minimizing time of query/URL matching to efforts to completing tasks
Bing Dialog Model – A stateful feedback
system
Knowledge &
Memory
Previous
Inferences
(K, It - 1)
Expected
Behavior
++
-
Inferred
Intent (It)
Intent
Model
User Behavior (Ut)
Bayesian Minimum Risk framework
It = arg max P(I | Ut, K, It-1)
At = arg min E[Cost(A, It )]
Inferred
Intent (It )
User Behavior
Observer
Interaction
Model
Inferred
Action (At)
Bing Dialog Model implementation
Document-level
dialog
Session-level
Dialog
Query-level
Dialog
Page-level
Dialog
Going beyond documents – Entities
Building deep
understanding
of user tasks
Going beyond documents – Entities
Enabling
user task completion
The QED Framework
Understand Queries, Entities and Documents
arg max 𝐸[𝑅(𝐷1 , … 𝐷𝑛 )|𝑄]
(10 blue links optimization)
Query
Document
arg max 𝐸[𝑅(𝐷1 , … 𝐷𝑛 , 𝐸1 , … 𝐸𝑚 )|𝑄]
(Docs + Entities joint optimization)
Query
Document
Entity
Bing Dialog Measurements
• User Satisfaction
•
•
•
•
Session Success Rate (SSR)
Time to Success (TTS)
“Modeling action-level satisfaction…” (SIGIR 2014)
“Struggling or exploring? Disambiguate search sessions” (WSDM 2014)
• User Engagement
• “Context aware web search abandonment prediction” (SIGIR 2014)
• “Evaluating and predicting user engagement…” (WWW 2013)
Knowledge-powered MAS
Current MAS is a self-contained graph
Papers
Authors
Journals
Conferences
Organizations
Field of Study
Author homepages
Author Images
Geolocations
Index
Academic is now a subset of larger graph
(Map to the Web Presence, Wiki, LinkedIn)
Papers
Authors
Journals
Conferences
Organizations
Field of Study
Author homepages
Author Images
Geolocations
Academic
Graph
Knowledge-powered Scholar Engines
• Knowledge bases:
•
•
•
•
Mostly from “high quality” structured data feeds
Lots of manual efforts in ingestion, conflation, indexing and serving
Errors, holes, outdated information abound
Guess the size of Bing KB? 
• Marrying knowledge bases with scholar engines:
•
•
•
•
•
Entity linking between KB and DL
Correct and complete missing entity types
Detect new entities
Cross-link to crowd-sourced, “noisier” knowledge bases
Extract knowledge from NL documents
MAS Powered by Bing Dialog Engine
• Can answer very complicated queries
• Papers by [Author] after [Year] in [FOS]
MAS Powered by Bing Dialog Engine
• Can answer very complicated queries
• Papers by [Author] after [Year] in [FOS]
• Papers citing [Author] about [FOS]
MAS Powered by Bing Dialog Engine
• Can answer very complicated queries
• Papers by [Author] after [Year] in [FOS]
• Papers citing [Author] about [FOS]
• Papers by [Author] and [Author]
Recommendation Services (In Production)
Papers
From Desktop Search to Mobile Personal Assistant (Microsoft Cortana)
Notebook Settings
Publication about Deep Learning
ICML reminder settings
Counting down to submission deadline
Notify me about
Notification Due
Main Conference and workshop
academic
Submission Deadline
deep learning
Latest publications
ICML
Reminder setting
Geoffrey Hinton
Latest publications
Added or
inferred
Challenges (and Opportunities) Ahead
• Research Community Engagement
• Disambiguation (KDD Cup 2013)
• Measuring & predicting impact
• Impact within domains
• Scaling out entity linking
• Opportunities to partner on research
• Data? Challenges?
Okay really Thank you 
Yang Song, Microsoft Research
[email protected]
http://research.microsoft.com/people/yangsong/