Presentation - Xin Luna Dong's Homepage!

Download Report

Transcript Presentation - Xin Luna Dong's Homepage!

Similarity Search for
Web Services
Xin (Luna) Dong, Alon Halevy,
Jayant Madhavan, Ema Nemes, Jun Zhang
University of Washington
Web Service Search



Web services are getting popular within
organizations and on the web
The growing number of web services raises the
problem of web-service search.
First-generation web-service search engines do
keyword search on web-service descriptions

BindingPoint, Grand Central, Web Service List,
Salcentral, Web Service of the Day, Remote Methods,
etc.
Keyword Search does not Capture the
Underlying Semantics
zip
Keyword Search does not Capture the
Underlying Semantics
50
Keyword Search does not Capture the
Underlying Semantics
zipcod
e
Keyword Search does not Capture the
Underlying Semantics
18
Keyword Search does not Accurately
Specify Users’ Information Needs
Keyword Search does not Accurately
Specify Users’ Information Needs
Users Need to Drill Down to Find the
Desired Operations
Choose a web service
Users Need to Drill Down to Find the
Desired Operations
Choose an operation
Users Need to Drill Down to Find the
Desired Operations
Enter the input parameters
Users Need to Drill Down to Find the
Desired Operations
Results – output
How to Improve Web Service Search?
Offer users more flexibility by providing
similar operations
 Base the similarity comparison on the
underlying semantics

1) Provide Similar WS Operations

Op1: GetTemperature



Input: Zip, Authorization
Output: Return
Op2: WeatherFetcher


Input: PostCode
Output: TemperatureF, WindChill, Humidity
Similar
Operations
 Select the
most appropriate
one
2) Provide Operations with Similar Inputs/Outputs

Op1: GetTemperature



Op2: WeatherFetcher



Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode



Input: Zip, Authorization
Output: Return
Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState


Input: ZipCode
Output: City, State
Similar
Inputs
 Aggregate
the results of
the operations
3) Provide Composable WS Operations

Op1: GetTemperature



Op2: WeatherFetcher




Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState



Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode


Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode


Input: City, State
Output: ZipCode
Input of Op2 is
similar to
Output of Op5
 Compose
web-service
operations
Searching with Woogle
Similar Operations,
Inputs, Outputs
Composable with
Input, Output
Searching with Woogle
A sample list of
similar operations
Jump from operation
to operation
Elementary Problems

Two elementary problems:



Operation matching: Given a web-service operation,
return a list of similar operations
Input/output matching: Given the input/output of a
web-service operation, return a list of web-service
operations with similar inputs/outputs
Goal:


High recall: Return potentially similar operations
Good ranking: Rank closer operations higher
Can We Apply Previous Work?

Software component matching


Schema matching



Require the knowledge of implementation
– We only know the interface
Similarity on different granularity
Web services are more loosely related
Text document matching


TF/IDF: term frequency analysis
E.g. Google
Why Text Matching Does not Apply?

Web page: often long text
Web service: very brief description
 Lack of information
Web Services Have Very Brief
Descriptions
Why Text Matching Does not Apply?


Web page: often long text
Web service: very brief description
 Lack of information
Web page: mainly plain text
Web service: more complex structure
 Finding term frequency is not enough
Operations Have More Complex Structures

Op1: GetTemperature



Op2: WeatherFetcher




Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState



Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode


Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode


Input: City, State
Output: ZipCode
Similar use of words, but
opposite functionality
Our Solution
Part 1: Exploit Structure
Web service
description
Web Service
Corpus
Operation name
and description
Input parameter
names
Output parameter
names
Operation
Similarity
Why Text Matching Does not Apply?



Web page: often long text
Web service: very brief description
 Lack of information
Web page: mainly plain text
Web service: more complex structure
 Finding term frequency is not enough
Operation and parameter names are highly varied
 Finding word usage patterns is hard
Parameter Names Are Highly Varied

Op1: GetTemperature



Op2: WeatherFetcher




Input: Zipcode
Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState



Input: PostCode
Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode


Input: Zip, Authorization
Output: Return
Input: ZipCode
Output: City, State
Op5: CityStateToZipCode


Input: City, State
Output: ZipCode
Our Solution
Part 2: Cluster Parameters into Concepts
Web service
description
Web Service
Corpus
Operation name
and description
Input parameter
namesnames
& concepts
Concepts
Output parameter
namesnames
& concepts
Operation
Similarity
Outline
Overview
Clustering
parameter names
Experimental evaluation
Conclusions and ongoing work
Clustering Parameter Names


Heuristic: Parameter terms tend to express the
same concept if they occur together often
Strategy: Cluster parameter terms into concepts
based on their co-occurrences

Given terms p and q, similarity from p to q:



Sim(pq) = P(q|p)
Directional: e.g. Sim (zipcode) > Sim (codezip)
(ZipCode v.s. TeamCode, ProxyCode, BarCode, etc.)
Term p is close to q:

Sim(pq) > Threshold e.g. city is close to state.
Criteria for an Ideal Clustering

High cohesion and low correlation



cohesion measures the intra-cluster term similarity
correlation measures the inter-cluster term similarity
avg(cohesion)
cohesion/correlation score =
avg(correlation)
Clustering Algorithm (I)


Algorithm – a series of refinements of the classic
agglomerative clustering
Basic agglomerative clustering: merge clusters I
and J if term i in I is close to term j in J
Clustering Algorithm (II)


Problem:
{temperature, windchill} + {zip}
=> {temperature, windchill, zip}
Solution:


Cohesion condition: each term in the result cluster is
close to most (e.g. half) of the other terms in the
cluster
Refined Algorithm: merge clusters I and J only if the
result cluster satisfies the cohesion condition
Clustering Algorithm (III)


Problem:
{code, zip} + {city, state, street} =>
{code} + {zip, city, state, street}
Solution: split before merge
I
J
I’ I I-I’
I’ I I-I’
I’
I’
I-I’
J’ J J-J’
I-I’
J
Clustering Algorithm (IV)


Problem:
{city, state, street} + {zip, code}
=> {city, state, street, zip, code}
Solution:


noise terms – most (e.g. half) of the occurrences are
not accompanied by other terms in the concept
After a pass of splitting and merging, remove noise
terms.
Clustering Algorithm (V)

Problems:



The cohesion condition is too strict for large concepts
The terms taken off during splitting lose the chance to
merge with other terms
Solution: Run the algorithm iteratively
do{
refined agglomerative clustering (a set of splitting-and-merging);
remove noise terms;
replace each term with its concept;
} while (no more merges)
Outlines
Overview
Clustering
parameter names
Experimental evaluation
Conclusions and ongoing work
Experiment Data and Clustering Results

Data set:




790 web services (431 are active)
1574 distinct operations
3148 inputs/outputs
Clustering results:


1599 parameter terms
623 concepts


441 single-term concepts (54 frequent terms and 387
infrequent terms)
182 multi-term concepts (59 concepts with more than 5 terms)
Example Clusters





(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Example Clusters





(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Example Clusters





(temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible,
pressure, condition, windchill, dewpoint, moonset,
sunrise, moonrise, sunset, heat, precipit, extend,
forecast, china, local, update)
(entere, enter, pitcher, situation, overall, hit, double,
strike, stolen, ball, rb, homerun, triple, caught, steal, pct,
op, slug, player, bat, season, stats, position, experience,
throw, players, draft, experier, birth, modifier)
(state, city)
(zip)
(code)
Measuring Top-K Precision

Benchmark

25 web-service operations




From several domains
With different input/output sizes and description sizes
Manually label whether the top hits are similar
Measure

Top-k precision: precision for the top-k hits
Top-k Precision for Operation Matching
Woogle
Ignore
structure
Text matching on
descriptions
Top-k Precision for Input/output Matching
Measuring Precision and Recall

Benchmark:

8 web-service operations and 15 inputs/outputs





From 6 domains
With different popularity
Inputs/outputs convey different numbers of concepts, and
concepts have varied popularity
Manually label similar operations and inputs/outputs.
Measure: R-P (Recall-Precision) curve
Impact of Multiple Sources of Evidences
in Operation Matching
Woogle
1
without
clustering
0.9
0.8
Precision
0.7
Func
0.6
Comb
0.5
ParOnly
0.4
Woogle
0.3
Text matching on
descriptions
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Ignore
structure
Impact of Parameter Clustering in
Input/output Matching
Woogle
1
0.9
Compare only
concepts
0.8
Precision
0.7
0.6
ParIO
0.5
ConIO
Woogle
0.4
0.3
Compare only
parameter names
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Conclusions


Defined primitives for web-service search
Algorithms for similarity search on web-service
operations



Exploit structure information
Cluster parameter names into concepts based on
their co-occurrences
Experiments show that the algorithm obtains
high recall and precision.
Ongoing Work I – Template search
on Operations
Input:
city state
Output:
weather
Description: forecast in the
next nine days
Ongoing Work I – Template search
on Operations
GetWeatherByCityState
Ongoing Work II – Composition
search on Operations
See compositions
Ongoing Work II – Composition
search on Operations
getZIPInfoByAddress
+GetNineDayForecastInfo
Ongoing Work III – Automatic Web
Service Invocation
city=“Seattle” state=“WA”
Similarity Search for
Web Services
@VLDB 2004
Xin (Luna) Dong, Alon Halevy,
Jayant Madhavan, Ema Nemes, Jun Zhang
University of Washington
www.cs.washington.edu/woogle
Ongoing Work I – Template search
on Operations
Italian CAP
Location Information
Holiday Information
Get Weather Forecast