Kirrkirr: A Java-based visualisation tool for XML

Transcript Kirrkirr: A Java-based visualisation tool for XML

Kirrkirr: A Java-based
visualisation tool for XML
dictionaries of Australian
Languages
Kevin Jansz
Department of Computer Science, University of Sydney, Australia
Christopher Manning
Computer Science and Linguistics, Stanford University, USA
Nitin Indurkhya
School of applied Science, Nanyang Technological University, Singapore
Project Objectives
providing innovative ways for representing a
dictionary, through creative use of the medium of
computers
 providing practical educationally useful programs as
a result (at low labor cost)
 examining the richness of lexical structure
Initial target: the Warlpiri dictionary.

Talk Outline



The research agendas
Kirrkirr: A Warlpiri dictionary browser
The Lexical Database
– exploiting the strengths of XML
– indexing XML data


User interface and visualization
User studies
Research Program: Lexicon

A language is more than individua words with a
definition
– it is a vast network of associations between words and within
and across the concepts represented by words



The aim of this work is to provide people with a better
understanding of this conceptual map.
Traditional paper dictionaries offer very limited ways
for making such networks visible
On a computer, one can imagine all sorts of ways of
bringing out such relationships
Research: Computational
Lexicography




Dictionaries on computers are now commonplace
But there has been little attempt to utilize the
potential of the new medium
Goal: fun dictionary tools that are effective for
language learning, browsing, and research
Special interest: dictionaries for minority
languages. Here economic, motivational, and user
support reasons all point to an important role for
computers.
MRD Structure



The internal structures of current Machine Readable
Dictionaries (MRDs) usually merely mimic the
structure of the printed form (Boguraev 1990)
Some work, notably WordNet (Miller 1995) has
involved a fundamental rethinking of dictionary
content and organization (in WordNet, organization
via “synsets” which are related via links of part,
subkind, opposite)
But there has been little in the way of software to
make such research truly usable by different
communities of users.
Initial focus
Kirrkirr: a Warlpiri browser





Warlpiri is an Australian Aboriginal language spoken
in the Tanami desert (NW of Alice)
Rich lexical materials have been collected by
linguists over decades (Ken Hale, MIT, from 1950’s)
resulting in one of the most comprehensive lexical
databases for any Australian Language
There is a relatively large community of people
interested in learning their traditional language
Until now, results haven’t been produced in a format
usable by the community (only raw printouts)
Kirrkirr aims to build a computer interface for
browsing the Warlpiri dictionary.
Educational goals




Dictionary structure and usability are often dictated
by professional linguists, while the needs of others
(speakers, semi-speakers, young users, second
language learners) are not met
Aim is to avoid this
A low level of literacy makes an e-dictionary
potentially more useful than a paper edition as it is
less dependent on good knowledge of spelling and
alphabetical order.
Making it fun and easy to use, and providing
multimedia content and the pronunciations of words
is a considerable help as well.
Target user community
Kirrkirr: A Warlpiri dictionary
browser
(Jansz 1998; Jansz, Manning and Indurkhya 1999)




An environment for the interactive exploration of
dictionaries.
Although our current work has just been with Warlpiri,
the design is general (Arrernte coming soon!)
Attempts to more fully utilize graphical interfaces,
hypertext, multimedia, and different ways of indexing
and accessing information
Written in Java, it can either be run over the web
[high bandwidth] or run locally (here Java’s main
advantage is cross-platform support).
Specific goals




An interactive environment that encouraged
exploration: easy and fun to use
Reduction of the dependence on alphabetical order
Catering to the needs of different user groups (kids,
teachers, professionals)
Flexible enough to display appropriate information in
appropriate ways depending on user level
Overview
Kirrkirr provides various modules
 Graph layout of word relationships
 Formatted dictionary entries
 Semantic domain browsing
 A notes facility for ‘jotting in the margin’
 Multimedia: audio, pictures
 Advanced searching interfaces
 others in planning: formatting (XSL) editing, figuration
patterns
These attempt to cater to users with different interests
and competence levels
(Kirrkirr screen shot)
The lexical database


Original materials are stored in an ad hoc format of
markup using backslash codes with some (rather
odd) nesting of structural tags
These were converted to XML using an errorcorrecting stack-based parser (written in PERL).
– The inconsistency and flexibility of dictionary entries actually
made this a surprisingly difficult task.
– But parser tries to impose data integrity

Use of XML gives a clear structure to the data, and
makes available many (free) tools
XML




XML separates the structure of the data from its
presentation
Much of the recent enthusiasm for XML has centered
around representing simple and rigid structures such
as database records
The rich hierarchical and variable structure of
dictionary entries is really more what something like
XML excels at!
Result remains a portable, tangible text file
Alternative: a standard database




The obvious thing for storing a lot of data
Has clear advantages: structure, indexing, query
language, relationships, integrity.
Many people have suggested using a database for
lexical data and some have actually done it (IITLEX,
Austin and Nathan)
But in general lexicographers oppose the rigidity,
and, in practice, standard relational databases are
quite ill-suited to dictionaries
Problems with using a Relational
Database






Dictionary entries vary enormously in structure
Data is fragmented
Dictionaries are only loosely structured
Same element can appear at many levels (dialect,
cross-reference, …)
Database model is inflexible to extending the
dictionary structure
Lessens portability
Alternative: Object Databases




Dictionary can be viewed as a set of entries (objects)
Object-oriented databases for storage
Problem: retrieval via customized query languages
Problem: off-the-shelf products not widely accepted
–
–
–
–
Proprietary storage formats reduce portability
ObjectStore, Versant, Objectivity the main big vendors
Restricted API places limits on extensibility
Generic object browsers not suitable for dictionaries
XML database



Document Object Model widely accepted
XML document can be searched and accessed
XQL: a recent (and evolving) W3C proposal for
querying XML documents
XQL - Potential




An alternative to investigate for the future is using a
standard query language – such as XQL – to get
material out of the XML dictionary, rather than using
our ad hoc index.
At the moment not a huge issue since most retrieval
is focussed on components of a particular word
XQL standard not stable yet
Very preliminary implementations from vendors
Extracting information from an
XML document

Build an index of its contents
– Index contains details of what is where (in an XML
document)
– Facilitates quick access to contents

Two steps for extracting information: lookup index,
then lookup XML document
– A good index can considerably speed up the 2nd lookup.
XML indexing - challenges



Despite the various XML parsers available, it is
surprising that there has been little consideration in
making single entries retrievable from the file
Present XML Parsers tend to put the entire XML
document in memory (or its parsed tree form), before
the data extraction process begins
This is not practical when parsing significant XML
databases (e.g., the Warlpiri dictionary is approx.
10Mb).
XML Indexing - solutions


The hierarchical structure of XML lends itself to
indexing, as each separate entry in the XML file can
be considered as a separate entity
To make the Warlpiri dictionary usable for Kirrkirr an
ad hoc indexing system was developed
– Uses a slightly modified Ælfred XML parser
– Entries are indexed by headword in a separate index file

The system returns an XML document object
containing the single dictionary entry, facilitating
– processing for related words (Graph layout)
– XSL processing to HTML
XML Indexing - solutions (2)

The use of the XML indexing process considerably
improves efficiency as only requested entries are
parsed, hence conserving time and bandwidth
– Once whole entries are parsed, they are kept temporarily in
a cache

Thus the System uses XML as a median between the
structure and indexing of a relational database, with
the freedom and functionality of XML.
Kirrkirr’s XML Index Process
Index in Memory
Kirrkirr
5
XML
document
object
Visualization of dictionary
information




For dictionaries with simple textual content behind
them, there is little that can be done but an on-line
reflection of a printed page
But we want more than just definitions of words: we
want to know their relationships to other words, and
the patterning in these relationships
In a computational approach, the program can
mediate between the lexical data and the user
The interface can select from and choose how to
present information (according to the user’s
preferences) – in many different ways
Previous work




Current systems present the search-dominated
interface of classic Information Retrieval systems:
you type a word in a search box
Results try to mimic, but are generally inferior to, the
printed version of the dictionary
Good feature: rapid searching
But these systems do little to utilize the captivating
qualities of computers: interactivity, user control and
adaptability (Brown 1985).
Previous work (2)




Current systems are only effective when user has a
clearly specified information need – even here, we
are ignoring the distinction between information
gained and knowledge sought (Sharpe 1995)
Lack browsing, and chances for incidental or curiosity
driven learning
Lack tangibility and situatedness of paper: ineffective
for getting an idea of a collection
We wish to exploit the essence of hypertext, which is
“click to explore” browsing
Previous work (3)




Little research work (in corpus linguistics,
visualization etc.) on dictionary visualization
WordNet built a rich network of relationships, which
fundamentally departed from the paper dictionary
tradition, and has been used in many computational
projects
However very little has been done in the way of
interfaces that make these relationships visible and
intelligible to users.
Graphical representations seem particularly important
given our target users.
Graph-based visualization



There is a little previous work on graphical
representations of dictionaries
For instance, the visual-thesaurus by plumbdesign
derived from WordNet
But it is also a good demonstration of how chaotic
and confusing graphical interfaces can become.
Perils of visualization
Graph-based visualization
(Jansz 1998; Jansz, Manning and Indurkhya 1999)
 Classic graph layout problem
 Adapts work by Eades et al. (1998) and Huang et al.
(1998) on visualization and navigation of WWW
document linkages
 Uses the spring algorithm. Big advantage is that it is an
iterative updating algorithm, and so gives an easy
interactivity:
– it wiggles and people can play with it.

Clarity and simplicity of graph: Software maintains a set
of focus nodes to prevent overcrowding
Educational advantages




Alphabetical order is important, but
A web of words offers other effective opportunities for
learning
A student can opportunistically explore words that are
related in various ways
Important semantic relationships can be understood
Kirrkirr network display
Kirrkirr network display
Formatted dictionary entries






Are produced automatically from the XML by using
XSL (via James Clark’s XT)
XSL allows easy modeling of some user preferences.
Most trivially, one can leave out information such as
part of speech, or detailed definitions, which we do by
providing several stylesheets to choose from
This is useful as many users find information
overload quite confusing and demotivating
Can produce bilingual or monolingual dictionary
Opportunities for various output styles, and formats
such as RTF or TeX for printing.
Formatted dictionary entries
Rich typology of link types




The semantically rich types of linkages present in a
dictionary (synonym, antonym, hyponym,
subheadword, variant, coverbs, …) solves one of the
major problems of the web: we have many link types
with a clear semantic interpretation
Use consistent color-coded text and edges to show
these link types
Gives a richer browsing experience
Unlike HTML, you can tell where you are going
before clicking
Browsing



Work (at PARC and elsewhere: Pirolli et al. 1996) has
stressed role for browsing as well as searching in
information access
It provides a context for learning
We provide browsing in several ways:
– conventional hypertext
• but with rich semantically-interpreted links
• their color-coding matches network edges
– network-based display of words

Other methods being investigated:
– browsing through semantic domains
– deriving terminology sets (words that are used together in
culturally important activities) automatically from text corpora
Other components

Multimedia (currently pictures and audio)
– Can hear pronunciations / see objects
– I’m keen to put in videos of Warlpiri sign language …

Advanced search page
– search various fields, regular expressions, etc.

Notes: one can annotate dictionary entries (to correct
or personalize)
User study
Mim Corris (Yuendumu, Willowra)
Jane Simpson (Lajamanu)
 User testing with primary and (lower) secondary
students
 Observation of trainee Warlpiri literacy workers
 Comments from teachers, other adults etc.
 Purely qualitative observational study of dictionary
use. (Doing anything much else would be difficult.)
 Initial reactions are very enthusiastic
 Could use as a basis for classroom activities (better
with some further development: games and puzzles)
A positive anecdote
“One of the introductory Warlpiri literacy students, who had not
been very interested in the literacy class, spent nearly 3/4 hour
looking at Kirrkirr apparently in absorbed concentration. She
wasn’t especially interested in the sound and picture
possibilities. She moved between words, scrolling along the list,
typing in the search, clicking on the words in the network pane.
She wasn’t even put off when the dictionary definitions stopped
appearing – looking at the networks of words instead. This is
quite unlike her attitude to the backslash coded electronic
dictionary (where she lost interest quickly because of the
difficulty for her of narrowing down searches). After the Kirrkirr
demo she asked if she could have a printed dictionary to take
away with her to use in camp to learn the words. I interpret this
as a desire to learn words in her own time and place.”
Conclusions



Kirrkirr is just a prototype of what one can do to
develop new ways to visualize lexicons
We have addressed the challenge of making
dictionary information usable in the creation of an
application which mediates between well-structured
data and users’ needs for searching/browsing and
presentation
While we have focused our research on Warlpiri, the
system can be easily applied to other languages
Conclusions (cont.)


“... The best future applications of MRDs in
education will be those most able to respond to the
insights and needs of their users” (Kegl 1995)
Kirrkirr can be seen as a step towards the future of edictionaries

Kirrkirr: A Java-based visualisation tool for XML

Transcript Kirrkirr: A Java-based visualisation tool for XML

Directory