‘Class Exercise’ III: Application Project Evaluation Deborah McGuinness and Joanne Luciano with Peter Fox and Li Ding CSCI/ITEC-6962-01 Week 11, November 15, 2010

Download Report

Transcript ‘Class Exercise’ III: Application Project Evaluation Deborah McGuinness and Joanne Luciano with Peter Fox and Li Ding CSCI/ITEC-6962-01 Week 11, November 15, 2010

‘Class Exercise’ III: Application
Project Evaluation
Deborah McGuinness and Joanne Luciano
with Peter Fox and Li Ding
CSCI/ITEC-6962-01
Week 11, November 15, 2010
1
Contents
•
•
•
•
Review of reading, questions, comments
Evaluation
Summary
Next week
2
Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, et c.
Adopt
Leverage
Technology Science/Expert
Rapid
Technology
Open World: Prototype
Approach Review & Iteration
Infrastructure
Evolve, Iterate,
Redesign,
Redeploy
Use Tools
Evaluation
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
3
References
• Twidale, Randall and Bentley (1994) and
references therein
• Scriven (1991, 1996)
• Weston, Mc Alpine, and Bordonaro, (1995)
• Worthen, Sanders, and Fitzpatrick, (1997)
4
Inventory
• What categories can you measure?
•
•
•
•
•
•
•
Users
Files
Databases
Catalogs
Existing UI capabilities (or lack thereof)
Services
Ontologies
• In the stage of use case development is a very good
time to capture these elements; do not guess, get
them from quantitative sources or the users/ actors
5
Metrics
• Things you can measure (numerical)
• Things that are categorical
• Could not do before
• Faster, more complete, less mistakes, etc.
• Wider range of users
• Measure or estimate the baseline before you
start
6
Result / Outcome
• Refer to the use case document
• Outcome (and value of it) is a combination of
data gathering processes, including surveys,
interviews, focus groups, document analysis
and observations that will yield both
qualitative and quantitative results.
• Did you meet the goal?
• Just listen… do not defend … if you start to
then: QTIP – quit taking it personally
7
Example: what we wanted to know
about VSTO
• Evaluation questions are used to determine
the degree to which the VSTO enhanced
search, access, and use of data for scientific
and educational needs and effectively utilized
and implemented a template for user-centric
utilization of the semantic web methodology.
• VO – appears to local and integrated and in
the end-users language (this is one of the
metrics)
8
Evaluation (Twidale et al.)
• An assessment of the overall effectiveness of
a piece of software, ideally yielding a numeric
measure by which informed cost-benefit
analysis of purchasing decisions can be
made.
• An assessment of the degree to which the
software fulfils its specification in terms of
functionality, speed, size or whatever
measures were pre-specified.
9
Evaluation
• An assessment of whether the software fulfils
the purpose for which it was intended.
• An assessment of whether the ideas
embodied in the software have been proved
to be superior to an alternative, where that
alternative is frequently the traditional solution
to the problem addressed.
• An assessment of whether the money
allocated to a research project has been
productively used, yielding useful
10
generalizeable results.
Evaluation
• An assessment of whether the software
proves acceptable to the intended end-users.
• An assessment of whether end-users
continue to use it in their normal work.
• An assessment of where the software fails to
perform as desired or as is now seen to be
desirable.
• An assessment of the relative importance of
the inadequacies of the software.
11
(Orthogonal) Dimensions of evaluations
Structured
Less structured
Quantitative
Qualitative
Summative
Formative
Controlled experiments
Ethnographic observations
Formal and rigorous
Informal and opportunistic
http://janus.ucc.nau.edu/edtech/etc667/proposal/evaluation/summative_vs._formative.htm
12
Formative and Summative
• evaluation carried out for two reasons:
• grading translations = summative evaluation
• giving feedback = formative evaluation
• “When the cook tastes the soup, that’s formative; when
the guests taste the soup, that’s summative." (Stakes)
13
Formative and Summative
14
What if questions (qualitative)
• could not only use your data and tools but
remote colleague's data and tools?
• understood their assumptions, constraints,
etc. and could evaluate applicability?
• knew whose research currently (or in the
future) would benefit from your results?
• knew whose results were consistent (or
inconsistent) with yours?
15
Evaluation questions and associated data
collection methods
Evaluation questions: To
what extent does VSTO’s
Interviews/
Focus
Group
Surveys
Document
Analysis
Observatio
n
Activities enhance end-user
access to and use of data to
advance science and
education needs?
X
X
X
X
Activities enable higher
levels of semantic capability
and interoperability such as
explanation, reasoning on
rules, and semantic query?
X
X
X
X
Contribute to the
development and support of
community resources,
virtual observatories and
data systems and provision
of results from diverse
observing systems using
semantically-enabled
technologies?
X
X
X
X
16
Evaluation questions and associated data
collection methods
Evaluation questions: To
what extent does X’s
Interviews/
Focus
Group
Surveys
Document
Analysis
Observation
Template contribute to the
reports on modern data
frameworks, user interfaces,
and science progress
achieved?
X
X
X
X
Incorporate user
experiences in the redesign
and development cycles of
the VSTO?
X
X
X
X
17
Evaluation questions and associated data
collection methods
Evaluation questions:
Interviews/
Focus
Group
How do VSTO activities
affect IHE faculty and staff
from participating institutions
(e.g., changes to virtual
observatories) and data
sources, results from diverse
observing systems using
sematically-enabled
technologies and institutional
collaboration activities?
X
What factors impede or
facilitate progress toward
VSTO goals?
X
What progress has been
made toward sustaining and
‘scaling up’ VSTO activities?
X
Surveys
Document
Analysis
Observation
X
X
X
X
X
X
18
Implementing an evaluation
• Based on our experience with use case
development and refinement, community
engagement, and ontology vetting, a workshop
format (6 up to 25 participants, depending on
desired outcomes and scope) is a very effective
mechanism to make rapid progress.
• The workshops can be part of a larger meeting,
stand-alone or partly virtual (via remote
telecommunication).
• We have found (for example, in our data integration
work) that domain experts in particular are
extremely willing to participate in these workshops.
19
Implementing
• Let’s take an example
• VSTO
• Representative but does not exercise all
semantic web capabilities
20
VSTO qualitative results
• Decreased input requirements: The previous
system required the user to provide 8 pieces
of input data to generate a query and our
system requires 3. Additionally, the three
choices are constrained by value restrictions
propagated by the reasoning engine. Thus,
we have made the workflow more efficient
and reduced errors (note the supportive user
comments two paragraphs above)
21
VSTO qualitative results
• Syntactic query support: The interface
generates only syntactically correct queries.
The previous interface allowed users to edit
the query directly, thus providing multiple
opportunities for syntactic errors in the query
formation stage. As one user put it: “I used to
do one query, get the data and then alter the
URL in a way I thought would get me similar
data but I rarely succeeded, now I can quickly
re-generate the query for new data and
always get what I intended”.
22
VSTO qualitative results
• Semantic query support: By using
background ontologies and a reasoner, our
application has the opportunity to only expose
query options that will not generate
incoherent queries. Additionally, the interface
only exposes options for example in date
ranges for which data actually exists. This
semantic support did not exist in the previous
system. In fact we limited functionality in the
old interface to minimize the chances of
misleading or semantically incorrect query
23
construction.
VSTO qualitative results
• Semantic query support: for example, that a
user has increased functionality – i.e., they
can now initiate a query by selecting a class
of parameter(s). As the query progresses, the
sub-classes and/or specific instances of that
parameter class are available as the datasets
are identified later in the query process.
24
VSTO qualitative results
• Semantic query support: We removed the
parameter initiated search in the previous system
because only the parameter instances could be
chosen (8 different instances to represent neutral
temperature, 18 representations of time, etc.) and it
was too easy for the wrong one to be chosen,
quickly leading to a dead-end query and frustrated
user. One user with more than 5 years of CEDAR
system experience noted: “Ah, at last, I’ve always
wanted to be able to search this way and the way
you’ve done it makes so much sense”.
25
VSTO qualitative results
• Semantic integration: Users now depend on
the ontologies rather than themselves to
know the nuances of the terminologies used
in varying data collections. Perhaps more
importantly, they also can access information
about how data was collected including the
operating modes of the instruments used.
“The fact that plots come along with the data
query is really nice, and that when I selected
the data it comes with the correct time
parameter” (New graduate student, ~ 1 year
of use).
26
VSTO qualitative results
• Semantic integration: The nature of the encoding of
time for different instruments means that not only
are there 18 different parameter representations but
those parameters are sometimes recorded in the
prologue entries of the data records, sometimes in
the header of the data entry (i.e. as metadata) and
sometimes as entries in the data tables themselves.
Users had to remember (and maintain codes) to
account for numerous combinations. The semantic
mediation now provides the level of sensible data
integration required.
27
VSTO qualitative results
• Broader range of potential users: VSTO is
usable by people who do not have PhD level
expertise in all of the domain science areas,
thus supporting efforts including
interdisciplinary research. The user
population consists of: Student (undergraduate, graduate) and non-student
(instrument PI, scientists, data managers,
professional research associates).
28
VSTO quantitative results
• Broader range of potential users: For
CEDAR, students: 168, non-students: 337,
for MLSO, students: 50, non-students: 250. In
addition 36% and 25% of the users are nonUS based (CEDAR – a 57% increase over
the last year - and MLSO respectively). The
relative percentage of students has increased
by ~10% for both groups.
29
Adoption (circa 2007)
• Currently there are on average between 80-90
distinct users authenticated via the portal and
issuing 400-450 data requests per day, resulting in
data access volumes of 100KB to 210MB per
request. In the last year, 100 new users have
registered, more than four times the number from
the previous year. The users registered last year
when the new portal was released, and after the
primary community workshop at which the new
VSTO system was presented. At that meeting,
community agreement was given to transfer
operations to the new system and move away from
the existing one.
30
Facilitating new projects
• At the community workshop a priority-area
was identified which involved the accuracy
and consistency of temperature
measurements determined from instruments
like the Fabry-Perot Interferometer. As a
result, we have saw a 44% increase in data
requests in that area. We increased the
granularity in the related portion of the
ontology to facilitate this study.
31
Facilitating new projects
• We focused on improving a users’ ability to
find related or supportive data, with which to
evaluate the neutral temperatures under
investigation. We are seeing an increase
(10%) in other neutral temperature data
accesses, which we believe is a result of this
related need.
32
Informal evaluation
• We conducted an informal user study asking
three questions: What do you like about the
new searching interface? Are you finding the
data you need? What is the single biggest
difference? Users were already changing the
way they search for and access data.
Anecdotal evidence indicated that users are
starting to think at the science level of
queries, rather than at the former syntactic
level.
33
Informal evaluation
• For example, instead of telling a student to
enter a particular instrument and date/time
range and see what they get, they are able to
explore physical quantities of interest at
relevant epochs where these quantities go to
extreme values, such as auroral brightness at
a time of high solar activity (which leads to
spectacular auroral phenomena). This
suggested to us some new use cases to
support even greater semantic mediation
34
Further measuring
• One measure that we hoped to achieve is to
have usage by all levels of domain scientist –
from the PI to the early level graduate
student. Anecdotal evidence shows this is
happening and self classification also
confirms the distribution. A scientist doing
model/observational comparisons: noted
“took me two passes now, I get it right away”,
“nice to have quarter of the options”, and “I
am getting closer to 1 query to 1 data
retrieval, that’s nice”.
35
Focus group
• A one hour workshop was held at the annual
community meeting on the day after the main
plenary presentation for VSTO. The workshop was
very well attended with 35 diverse participants (25
were expected) ranging from a number senior
researchers, junior researchers, post-doctoral
fellows and students - including 3 that had just
started in the field.
• After some self-introductions eight questions were
posed and responses recorded, some by count
(yes/no) or comment. Overall responses ranged
from 5 to 35 per question.
36
VSTO quantitative results
• How do you like to search for data? Browse,
type a query, visual? Responses: 10;
Browse=7, Type=0, Visual=3.
• What other concepts are you interested in
using for search, e.g. time of high solar
activity, campaign, feature, phenomenon,
others? Responses: 5; all of these, no others
were suggested.
• Does the interface and its services deliver the
functionality, speed, flexibility you require?
37
Responses: 30; Yes=30, No=0.
VSTO quantitative results
• Are you finding the data you need?
Responses: 35; Yes=34, No=1.
• How often do you use the interface in your
normal work? Responses: 19; Daily=13,
Monthly=4, Longer=2.
• Are there places where the interface/ services
fail to perform as desired? Responses: 5;
Yes=1, No=4.
38
Qualitative questions
• What do you like about the new searching
interface? Responses: 9.
• What is the single biggest difference?
Responses: 8.
• The general answers were as follows:
• Less clicks to data (lots)
• Auto identification and retrieval of independent
variables (lots)
• Faster (lots)
• Seems to converge faster (few)
39
Unsolicited/ unstructured comments
•
•
•
•
•
•
It makes sense now!
[I] Like the plotting.
Finding instruments I never knew about.
Descriptions are very handy.
What else can you add?
How about a python interface [to the
services]?
40
Surprise! New use cases
• The need for a programming/ script level
interface, i.e. building on the services
interfaces; in Python, Perl, C, Ruby, Tcl, and
3 others.
• Addition of models alongside observational
data, i.e. find data from observations/ models
that are comparable and/or compatible.
• More services (particularly plotting options e.g. coordinate transformation - that are hard
to add without detailed knowledge of the
data).
41
Other examples
• CALO – Trust studies
•
Alyssa Glass, Deborah L. McGuinness, Paulo Pinheiro da Silva, and
Michael Wolverton. Trustable Task Processing Systems. In RothBerghofer, T., and Richter, M.M., editors, KI Journal, Special Issue on
Explanation, Kunstliche Intelligenz, 2008.
• NIMD – Intelligence Analyst Study
•
Andrew. J. Cowell, Deborah L. McGuinness, Carrie F. Varley, and David
A. Thurman. Knowledge-Worker Requirements for Next Generation Query
Answering and Explanation Systems. In the Proceedings of the Workshop
on Intelligent User Interfaces for Intelligence Analysis, International
Conference on Intelligent User Interfaces (IUI 2006), Sydney, Australia.
abstract
42
Keep in mind
• You need an evaluation plan that can lead to
improvements in what you have built
• You need an evaluation to value what you
have built
• You need an evaluation as part of your
publication (and thesis)
43
Iterating
• Evolve, iterate, re-design, re-deploy
• Small fixes
• Full team must be briefed on the evaluation
results and implications
• Decide what to do about the new use cases, or if
the goal is not met
• Determine what knowledge engineering is
required and who will do it (often participants in
the evaluation may become domain experts in
your methodology)
• Determine what new knowledge representation
• Assess need for an architectural re-design
44
Summary
•
•
•
•
Project evaluation has many attributes
Structured and less-structured
Really need to be open to all forms
A good way to start is to get members of your
team to do peer evaluation
• This is a professional exercise, treat it that
way at all times
• Other possible techniques for moving forward
on evolving the design, what to focus upon,
45
priorities, etc.: SWOT, Porter’s 5 forces
Next week
• This weeks assignments:
• Reading: no reading
• Next class (week 11 – November 22):
• Team Use Case Implementation
• Office hours this week –
• Questions?
46