Web Services Using SOAP, WSDL, and UDDI

Download Report

Transcript Web Services Using SOAP, WSDL, and UDDI

An Introduction To Categorization
Soam Acharya, PhD
[email protected]
1/15/2003
What is Categorization?
d1
…
…
dn
c1
a11
…
…
a1n
…
…
…
…
…
cm
am1
…
…
amn
• {c1 … cm} set of predefined categories
• {d1 … dn} set of candidate documents
• Fill decision matrix with values {0,1}
• Categories are symbolic labels
Uses
•
•
•
•
Document organization
Document filtering
Word sense disambiguation
Web
– Internet directories
– Organization of search results
• Clustering
Categorization Techniques
• Knowledge systems
• Machine Learning
Knowledge Systems
• Manually build an expert system
– Makes categorization judgments
– Sequence of rules per category
– If <boolean condition> then category
– If document contains “buena vista home
entertainment” then document category is
“Home Video”
UltraSeek Content Classification
Engine
UltraSeek CCE
Knowledge System Issues
• Scalability
– Build
– Tune
• Requires Domain Experts
• Transferability
Machine Learning Approach
• Build a classifier for a category
– Training set
– Hierarchy of categories
• Submit candidate documents for
automatic classification
• Expend effort in building a classifier, not
in knowing the knowledge domain
Machine Learning Process
taxonomy
Training
Training Set
documents
Document
Preprocessing
Classifier
documents
DB
Training Set
• Initial corpus can be divided into:
– Training set
– Test set
• Role of workflow tools
Document Preprocessing
• Document Conversion:
– Converts file formats (.doc, .ppt, .xls, .pdf
etc) to text
• Tokenizing/Parsing:
– Stemming
– Document vectorization
• Dimension reduction
Document Vectorization
• Convert document text into “bag of
words”
• Each document is a vector of n weighted
terms
Document
Federal express
3
Severe
3
Flight
2
Y2000-Q3
1
Mountain
2
Exactly
1
Simple
5
Document Vectorization
• Use tfidf function for term weighting
tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)]
# of times tk
occurs in dj
Cardinality of
training set
• tfidf value may be normalized
– All vectors of equal length
– [0,1]
# of documents
where tk occurs
at least once
Dimension Reduction
• Reduce dimensionality of vector space
• Why?
– Reduce computational complexity
– Address “overfitting” problem
• Overtuning classifier
• How?
– Feature selection
– Feature extraction
Feature Selection
• Also known as “term space reduction”
• Remove “stop” words
• Identify “best” words to be used in
categorizing per topic
– Document frequency of terms
• Keep terms that occur in highest number of
documents
– Other measures
• Chi square
• Information gain
Feature Extraction
• Synthesize new features from existing
features
• Term clustering
– Use clusters/centroids instead of terms
– Co-occurrence and co-absence
• Latent Semantic Indexing
– Compresses vectors into a lower
dimensional space
Creating a Classifier
• Define a function, Categorization Status
Value, CSV, that for a document d:
– CSVi: D -> [0,1]
– Confidence that d belongs in ci
• Boolean
• Probability
• Vector distance
Creating a Classifier
• Define a threshold, thresh, such that if
CSVi(d) > thresh(i) then categorize d
under ci otherwise, don’t
• CSV thresholding
– Fixed value across all categories
– Vary per category
• Optimize via testing
Naïve Bayes Classifier
Probability of
doc dj
belonging in
category ci
Training set terms/weights
present in dj used to calculate
probability of dj belonging to ci
Naïve Bayes Classifier
If wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci)
After further derivation, the original equation looks like:
Can be
used for
CSV
Constants
for all docs
Naïve Bayes Classifier
• Independence assumption
• Feature selection can be
counterproductive
k-NN Classifier
• Compute closeness between candidate
documents and category documents
Similarity between dj and
training set document dz
Confidence score
indicating whether
dz belongs to
category ci
k-NN Classifier
• k nearest neighbors
– Find k nearest neighbors from all training
documents and use their categories
– K can also indicate the number of top
ranked training documents per category to
compare against
• Similarity computation can be:
– Inner product
– Cosine coefficient
Support Vector Machines
Max. margin
Optimal
hyperplane
• “decision surface” that best separates data
points in two classes
• Support vectors are the training docs that
best define hyperplane
Support Vector Machines
• Training process involves finding the
support vectors
• Only care about support vectors in the
training set, not other documents
Neural Networks
• Train net to learn from a mapping of
input words to a category
• One neural net per category
– Too expensive
• One network overall
• Perceptron approach without a hidden
layer
• Three layered
Classifier Committees
•
•
•
•
Combine multiple classifiers
Majority voting
Category specialization
Mixed results
Classification Performance
• Category ranking evaluation
–
Recall = categories found and correct
Total categories correct
– Precision = categories found and correct
Total categories found
• Micro and Macro averaging over
categories
Classification Performance
• Hard
• Two studies
– Yiming Yang, 1997
– Yiming Yang and Xin Liu, 1999
• SVM, kNN >> Neural Net > Naïve Bayes
• Performance converges for common
categories (with many training docs)
Computational Bottlenecks
• Quiver
– # of topics
– # of training documents
– # of candidate documents
Categorization and the Internet
• Classification as a service
– Standardizing vocabulary
– Confidentiality
– performance
• Use of hypertext in categorization
– Augment existing classifiers to take
advantage
Hypertext and Categorization
• An already categorized document links
to documents within same category
• Neighboring documents in a similar
category
• Hierarchical nature of categories
• Metatags
Augmenting Classifiers
• Inject anchor text for a document into
that document
– Treat anchor text as separate terms
• Depends on dataset
• Mixed experimental results
• Links may be noisy
– Ads
– Navigation
Topics and the Web
• Topic distillation
– Analysis of hyperlink graph structure
• Authorities
– popular pages
• Hubs
– Links to authorities
hubs
authorities
Topic Distillation
• Kleinberg’s HITS algorithm
• An initial set of pages: root set
– Use this to create an expanded set
• Weight propagation phase
– Each node: authority score and hub score
– Alternate
• Authority = sum of current hub weights of all nodes
pointing to it
• Hub = sum of all authority score of all pages it points to
– Normalize node scores and iterate until
convergence
• Output is a set of hubs and authorities
Conclusion
•
•
•
•
•
Why Classifiy?
The Classification Process
Various Classifiers
Which ones are better?
Other applications