Presentation - Xin Luna Dong's Homepage!
Download
Report
Transcript Presentation - Xin Luna Dong's Homepage!
Similarity Search for
Web Services
Xin (Luna) Dong, Alon Halevy,
Jayant Madhavan, Ema Nemes, Jun Zhang
University of Washington
Web Service Search
Web services are getting popular within
organizations and on the web
The growing number of web services raises the
problem of web-service search.
First-generation web-service search engines do
keyword search on web-service descriptions
BindingPoint, Grand Central, Web Service List,
Salcentral, Web Service of the Day, Remote Methods,
etc.
Keyword Search does not Capture the
Underlying Semantics
zip
Keyword Search does not Capture the
Underlying Semantics
50
Keyword Search does not Capture the
Underlying Semantics
zipcod
e
Keyword Search does not Capture the
Underlying Semantics
18
Keyword Search does not Accurately
Specify Users’ Information Needs
Keyword Search does not Accurately
Specify Users’ Information Needs
Users Need to Drill Down to Find the
Desired Operations
Choose a web service
Users Need to Drill Down to Find the
Desired Operations
Choose an operation
Users Need to Drill Down to Find the
Desired Operations
Enter the input parameters
Users Need to Drill Down to Find the
Desired Operations
Results – output
How to Improve Web Service Search?
Offer users more flexibility by providing
similar operations
Base the similarity comparison on the
underlying semantics
1) Provide Similar WS Operations
Op1: GetTemperature
Input: Zip, Authorization
Output: Return
Op2: WeatherFetcher
Input: PostCode
Output: TemperatureF, WindChill, Humidity
Similar
Operations
Select the
most appropriate
one
2) Provide Operations with Similar Inputs/Outputs
Op1: GetTemperature
Op2: WeatherFetcher
Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode
Input: Zip, Authorization
Output: Return
Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState
Input: ZipCode
Output: City, State
Similar
Inputs
Aggregate
the results of
the operations
3) Provide Composable WS Operations
Op1: GetTemperature
Op2: WeatherFetcher
Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState
Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode
Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode
Input: City, State
Output: ZipCode
Input of Op2 is
similar to
Output of Op5
Compose
web-service
operations
Searching with Woogle
Similar Operations,
Inputs, Outputs
Composable with
Input, Output
Searching with Woogle
A sample list of
similar operations
Jump from operation
to operation
Elementary Problems
Two elementary problems:
Operation matching: Given a web-service operation,
return a list of similar operations
Input/output matching: Given the input/output of a
web-service operation, return a list of web-service
operations with similar inputs/outputs
Goal:
High recall: Return potentially similar operations
Good ranking: Rank closer operations higher
Can We Apply Previous Work?
Software component matching
Schema matching
Require the knowledge of implementation
– We only know the interface
Similarity on different granularity
Web services are more loosely related
Text document matching
TF/IDF: term frequency analysis
E.g. Google
Why Text Matching Does not Apply?
Web page: often long text
Web service: very brief description
Lack of information
Web Services Have Very Brief
Descriptions
Why Text Matching Does not Apply?
Web page: often long text
Web service: very brief description
Lack of information
Web page: mainly plain text
Web service: more complex structure
Finding term frequency is not enough
Operations Have More Complex Structures
Op1: GetTemperature
Op2: WeatherFetcher
Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState
Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode
Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode
Input: City, State
Output: ZipCode
Similar use of words, but
opposite functionality
Our Solution
Part 1: Exploit Structure
Web service
description
Web Service
Corpus
Operation name
and description
Input parameter
names
Output parameter
names
Operation
Similarity
Why Text Matching Does not Apply?
Web page: often long text
Web service: very brief description
Lack of information
Web page: mainly plain text
Web service: more complex structure
Finding term frequency is not enough
Operation and parameter names are highly varied
Finding word usage patterns is hard
Parameter Names Are Highly Varied
Op1: GetTemperature
Op2: WeatherFetcher
Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState
Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode
Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode
Input: City, State
Output: ZipCode
Our Solution
Part 2: Cluster Parameters into Concepts
Web service
description
Web Service
Corpus
Operation name
and description
Input parameter
namesnames
& concepts
Concepts
Output parameter
namesnames
& concepts
Operation
Similarity
Outline
Overview
Clustering
parameter names
Experimental evaluation
Conclusions and ongoing work
Clustering Parameter Names
Heuristic: Parameter terms tend to express the
same concept if they occur together often
Strategy: Cluster parameter terms into concepts
based on their co-occurrences
Given terms p and q, similarity from p to q:
Sim(pq) = P(q|p)
Directional: e.g. Sim (zipcode) > Sim (codezip)
(ZipCode v.s. TeamCode, ProxyCode, BarCode, etc.)
Term p is close to q:
Sim(pq) > Threshold e.g. city is close to state.
Criteria for an Ideal Clustering
High cohesion and low correlation
cohesion measures the intra-cluster term similarity
correlation measures the inter-cluster term similarity
avg(cohesion)
cohesion/correlation score =
avg(correlation)
Clustering Algorithm (I)
Algorithm – a series of refinements of the classic
agglomerative clustering
Basic agglomerative clustering: merge clusters I
and J if term i in I is close to term j in J
Clustering Algorithm (II)
Problem:
{temperature, windchill} + {zip}
=> {temperature, windchill, zip}
Solution:
Cohesion condition: each term in the result cluster is
close to most (e.g. half) of the other terms in the
cluster
Refined Algorithm: merge clusters I and J only if the
result cluster satisfies the cohesion condition
Clustering Algorithm (III)
Problem:
{code, zip} + {city, state, street} =>
{code} + {zip, city, state, street}
Solution: split before merge
I
J
I’ I I-I’
I’ I I-I’
I’
I’
I-I’
J’ J J-J’
I-I’
J
Clustering Algorithm (IV)
Problem:
{city, state, street} + {zip, code}
=> {city, state, street, zip, code}
Solution:
noise terms – most (e.g. half) of the occurrences are
not accompanied by other terms in the concept
After a pass of splitting and merging, remove noise
terms.
Clustering Algorithm (V)
Problems:
The cohesion condition is too strict for large concepts
The terms taken off during splitting lose the chance to
merge with other terms
Solution: Run the algorithm iteratively
do{
refined agglomerative clustering (a set of splitting-and-merging);
remove noise terms;
replace each term with its concept;
} while (no more merges)
Outlines
Overview
Clustering
parameter names
Experimental evaluation
Conclusions and ongoing work
Experiment Data and Clustering Results
Data set:
790 web services (431 are active)
1574 distinct operations
3148 inputs/outputs
Clustering results:
1599 parameter terms
623 concepts
441 single-term concepts (54 frequent terms and 387
infrequent terms)
182 multi-term concepts (59 concepts with more than 5 terms)
Example Clusters
(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Example Clusters
(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Example Clusters
(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Measuring Top-K Precision
Benchmark
25 web-service operations
From several domains
With different input/output sizes and description sizes
Manually label whether the top hits are similar
Measure
Top-k precision: precision for the top-k hits
Top-k Precision for Operation Matching
Woogle
Ignore
structure
Text matching on
descriptions
Top-k Precision for Input/output Matching
Measuring Precision and Recall
Benchmark:
8 web-service operations and 15 inputs/outputs
From 6 domains
With different popularity
Inputs/outputs convey different numbers of concepts, and
concepts have varied popularity
Manually label similar operations and inputs/outputs.
Measure: R-P (Recall-Precision) curve
Impact of Multiple Sources of Evidences
in Operation Matching
Woogle
1
without
clustering
0.9
0.8
Precision
0.7
Func
0.6
Comb
0.5
ParOnly
0.4
Woogle
0.3
Text matching on
descriptions
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Ignore
structure
Impact of Parameter Clustering in
Input/output Matching
Woogle
1
0.9
Compare only
concepts
0.8
Precision
0.7
0.6
ParIO
0.5
ConIO
Woogle
0.4
0.3
Compare only
parameter names
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Conclusions
Defined primitives for web-service search
Algorithms for similarity search on web-service
operations
Exploit structure information
Cluster parameter names into concepts based on
their co-occurrences
Experiments show that the algorithm obtains
high recall and precision.
Ongoing Work I – Template search
on Operations
Input:
city state
Output:
weather
Description: forecast in the
next nine days
Ongoing Work I – Template search
on Operations
GetWeatherByCityState
Ongoing Work II – Composition
search on Operations
See compositions
Ongoing Work II – Composition
search on Operations
getZIPInfoByAddress
+GetNineDayForecastInfo
Ongoing Work III – Automatic Web
Service Invocation
city=“Seattle” state=“WA”
Similarity Search for
Web Services
@VLDB 2004
Xin (Luna) Dong, Alon Halevy,
Jayant Madhavan, Ema Nemes, Jun Zhang
University of Washington
www.cs.washington.edu/woogle
Ongoing Work I – Template search
on Operations
Italian CAP
Location Information
Holiday Information
Get Weather Forecast