SemTag and Seeker: Bootstrapping the Semantic Web via
Download
Report
Transcript SemTag and Seeker: Bootstrapping the Semantic Web via
SemTag and Seeker: Bootstrapping the Semantic
Web via Automated Semantic Annotation
Presented by: Hussain Sattuwala
Stephen Dill, Nadav Eiron, David Gibson,
Daniel Gruhl, R. Guha, Anant Jhingran,
Tapas Kanungo, Sridhar Rjagopalan, Andrew
Tomkins, John A. Tomlin, Jason Y. Zien
IBM Almaden Research Center
http://www.almaden.ibm.com/webfountain/resources/semtag.pdf
Outline
Motivation
Goal
SemTag
Seeker
Architecture
Phases
TBD
Results
Methodology
Design
Architecture
Environment
Conclusion
Related and Future work.
Motivation
Natural language processing is the most
significant obstacle in building machine
understandable web.
To allow for the Semantic Web to become a
reality we need:
Web-services to maintain & provide metadata.
Annotated documents (OWL, RDF, XML, ...).
Annotations
Current practice of annotation for knowledge
identification , extraction & other applications
is time consuming
needs annotation by experts
is complex
Reduce burden of text annotation for Knowledge
Management
Goal
To perform automated semantic tagging of large
corpora.
To introduce a new disambiguation algorithm to
resolve ambiguities in a natural language corpus.
To introduce the platform which different tagging
applications can share.
SemTag
The goal is to automatic add semantic tags to the
existing HTML body of the web.
Example:
“The Chicago Bulls announced that Michael Jordan will…”
Will be:
The <resource ref = http://tap.stanford.edu/Basketball
Team_Bulls>Chicago Bulls</resource> announced
yesterday that <resource ref = “http://tap.stanford.edu/
AthleteJordan_Michael”> Michael Jordan</resource>
will...’’
SemTag
Uses TAP KB
TAP is a public broad, shallow knowledgebase.
TAP contains lexical and taxonomical information about
popular objects like music, movies, sports, etc.
Problem: No write access to original document
How do you annotate???
Uses the concept of Label Bureau from PICS
(Platform for Internet Content Selection)
HTTP server that can be queried for annotation
information
Separate store of semantic annotation information
Example: Annotated Page
SemTag Architecture
Add to DB
Disambiguate windows
Tagging
Retrieve documents
Automatic
Manual
Tokenize
Find Context
Spotting
determine distribution of terms
Learning
SemTag Phases
1. Spotting:
Retrieve documents from Seeker.
Tokenize documents.
Find contexts (10 words + label + 10 words) that
appears in TAP Taxonomy.
2. Learning:
Scan the representative sample to determine
distribution of terms at each internal node of the
taxonomy.
SemTag Phases, cont’d
3. Tagging
Disambiguate windows (using TBD).
Add to the database.
Ambiguities types:
Same label appears at multiple locations in TAP
ontology.
Some entities have labels that occur in context that
have no representative in the taxonomy.
Training Data:
Automatic metadata
Manual metadata
TBD Methodology
Each node has a set of labels.
E.g.: cats, football, cars all contain the label Jaguar.
Each label in the text is stored with a window of
20 words – the context
A spot(l,c) is a label in a context.
Each node has an associated similarity function
mapping a context to a similarity
Higher similarity more likely to contain a reference
TBD - Similarity
Generate 200k dimensional vector corresponding
to context.
TF-IDF scheme
Each entry of the vector is the frequency of the term
occurring at that node divide by corpus frequency of the
term.
IR Algorithm – Cosine Similarity
Vector product of sparse spot vector and dense node
vector
TBD - Algorithm
Some internal nodes very popular:
Associate a measurement Mus of how accurate Sim is
likely to be at a node.
Also Mua, how ambiguous the node is overall
(consistency of human judgment)
TBD Algorithm: returns 1 or 0 to indicate
whether a particular context c is on topic for a
node v
82% accuracy on 434 million spots
The TBD Algorithm
SemTag Results
Applied on 264 million pages
Produced 550 million labels.
Final set of 434 million spots with Accuracy 82%.
SemTag Methodology
1. Lexicon generation:
Approximately 90 million total words.
1.4 million unique words .
Most frequent 200,000 words.
2. Similarity functions:
Estimated distribution of terms corresponding to 192
most common TAP nodes to derive fu.
SemTag Methodology, cont’d
3. Measurement values:
Determined based on 750 relevant human judgments.
4. Full TBD Processing:
Applied to 550m spots.
5. Evaluation:
Compared TBD results with additional 378 human
judgments.
Seeker
A platform used by SemTag and other increasingly
sophisticated text analytics applications.
Provides scalable, extensible knowledge extraction
from erratic resources.
Erratic resources???
Seeker Design Goals
Composability
Modularity
Extensibilty
Scalability
Robustness
Seeker Architecture
SemTag
Components
Indexing Tokens
Crawls WEB
Storage &
Communication
Query
Processing
Annotators
Miners
Modular &
Extensible
Scalability & Robustness
n/w level APIs
Seeker Design
To achieve modularity and extensibility
SOA (service-oriented architecture) was used where
communication among agents is done through a set of
language-independent network-level APIs.
To achieve scalability and robustness
Infrastructure components.
Infrastructure Components
The Data Store
The Indexer
Central repository for all data storage.
Communication medium.
For indexing sequences of tokens.
The Joiner
Query processing component.
Analysis Agents
Annotators
Performs some local processing on each web page and
write back results to the store in form of an annotation.
Miners
Performs Intermediate processing
Looks at the results of spots on many pages in order to
disambiguate them.
Observation
Advantage
Other application can obtain semantic annotation from
web-available database.
Use both human & computer judgments to solve
ambiguous data in their TBD algorithm
Disadvantage
The system require a large amount of storage space to
store data.
Requires much larger and richer KB to build web scale
ontology.
Conclusion
Automatic semantic tagging is essential to
bootstrap the Semantic Web.
It’s possible to achieve good accuracy with simple
disambiguation approaches.
Future Work
Develop more approaches and algorithms to
automated tagging.
Make annotated data public and seeker as a
public service.
Related Work
Systems built as a result of the Semantic Web
are divided among two types:
Create ontologies – semi automated
Page annotation.
Examples: Protégé, OntoAnnotate, Anntea, SHOE, …
Some AI approaches were used, but, they need a
lot of training. Principal tool:Wrapping
Some used other NL understanding techniques,
example ALPHA.
Questions?