Free construction of a Swedish dictionary of synonyms

Download Report

Transcript Free construction of a Swedish dictionary of synonyms

Free Swedish Word Lists
or Hackers’ BLARK
Viggo Kann
KTH, Stockholm
GSLT meeting January 26, 2008
What is a free language resource?




Anyone can use it in an application
Anyone can study it and modify it
Anyone can take a copy of it
Anyone can improve it, release the
improvements to the public, so that
the whole community benefits
(based on four freedoms of free software,
Richard Stallman)
Strong free software culture
GNU project
 FSF – Free Software Foundation
 GPL – GNU General Public License
 OSI – Open Software Initiative
 Linux, TeX, Emacs, GCC, MySQL, PHP,
Java, Python, Firefox

First meeting of the Free Swedish
Words group at KTH January 16
11 persons from around Sweden
 Lars Aronsson: project Runeberg and
Swedish Wikipedia (Wiktionary)
 Lars Törnquist and Sven Lange:
Swedish thesaurus built on Bring (1930)
 Christian Mattson: Lexin dictionaries
Niklas Johansson: Spelling error
detection and correction in OpenOffice
 Göran Andersson: DSSO – The large
Swedish word list
 Viggo Kann: Stava, Granskatagger,
Synlex, Tvärslå Nordic dictionary
 Per Starrbäck, Leif-Jöran Olsson,
Tomas Padron-McCarthy, Erik Geijer

Plans for more free words
Swedish synonyms in OpenOffice
(Niklas)
 Extending DSSO with synonyms,
associations etc (Göran)
 Building a free Swedish-English
dictionary (Viggo)
 Testing Swedish grammar checking in
Languagetool/OpenOffice
(Viggo&Niklas)

Typical ways to construct a resource
…if you are a language …if you are a free
technologist:
software hacker:
 Get funding
 Use other free
resources
 Use resources that
are free to use for
 Collect data from
researchers
lots of people using
e.g. a wiki or a web
 Hire linguists to do the
form
heavy jobs
Example: Synlex
Construct a Swedish dictionary of
synonyms as a list of synonymous pairs
 I don’t want to work a lot
 I don’t want to pay anyone to work
 The resulting list should become free

Ideas
Automatically construct a large set of
word pairs that might be synonyms
 Use ten thousands of people, who are
each willing to make a small
contribution without payment, to check
the word pairs

More ideas
Use the Lexin on-line Swedish-English
dictionary web site, that had 9 millions
(now 25 M) of lookups each month
 Users visit Lexin to translate words, and
are thus probably motivated to help me
 Each time a user makes a lookup, give
her the opportunity to decide whether
two words are synonyms or not

My plan
1.
2.
3.
4.
Construct lots of possible synonyms
Sort out bad synonym pairs
automatically
Ask lots of users if the rest of the pairs
are good synonyms
Analyze the gradings done by the
users and decide which pairs to keep
Step 1:
Construct lots of possible synonyms
If we have access to a Swedish-English
dictionary SE and an English-Swedish
dictionary ES, try to translate each word
to English and back again to Swedish
 {(w,v): y: ySE(w)  vES(y)} or
{(w,v): y: ySE(w)  ySE(v)}
 616 000 word pairs were generated

Step 2: Remove bad synonym pairs
automatically
Use RI (Random Indexing)
[Kanerva, Kristoferson, Holst 2000]
to measure the distance between words
represented in a large vector space
 Keep pairs that have small enough
distance in the vector space

Step 3: Ask lots of users if the rest of
the pairs are good synonyms
When a user has sent a word to the Lexin
dictionary he receives the translation
followed by a question like:
Are 'spread' and 'lengthen' synonyms?
Answer using a scale from 0 to 5 where 0
means 'I don’t agree' and 5 means
'I do fully agree', or answer 'I don’t know'
Step 4: Analyzing the gradings
done by the users
1.2 millions gradings were made in less
than 2 months
 Grading statistics were analyzed on
several occasions
 Some users sent comments

More and more interesting
gradings as time goes by
60%
50%
40%
2005
2006
2007
30%
20%
10%
0%
0
1
2
3
4
5
don't
know
Distribution of mean gradings of
word pairs
40%
35%
30%
25%
2005
2006
20%
15%
10%
5%
0%
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5
Some statistics (January 2008)
2.8 M user gradings done
 75 000 pairs (graded ≥ 2) in dictionary
 108 000 pairs suggested by users
 62 000 unique pairs suggested
 20 000 of them have been accepted

Example: Synonyms to klass (class)
5: rang (grade)
rank (rank)
slag (kind)
4: kategori (category)
stånd (social class)
årskurs (grade)
3: fack (sphere)
grad (degree)
grupp (group)
kvalitet (quality)
nivå (level)
3: sort (sort)
standard (standard)
stil (style)
2: skikt (layer)
storleksordning
(magnitude)
typ (type)
1: poäng (point)
stadga (stability)
0: uppdrag (mission)
utbilda (educate)
How to prevent abuse?
Many gradings of a word pair are
needed before it’s considered to be
good
 The pair to be graded is randomly
picked from a very large list
 Word pairs suggested by users are spell
checked before they are added to the
very large list

People's definition of synonymy
Exact meaning of 'synonym' wasn’t
defined
 Users will grade using their intuitive
understanding of the concept of
synonymy and the words in the pair
 The produced dictionary will use the
people's own definition of synonymy
Hopefully this is exactly what they want!

Links
www.dsso.se
The large Swedish word list
 www.nada.kth.se/stava Spell checker
 lexin.nada.kth.se/synlex.html
75 000 synonyms
 sv.wiktionary.org 50 000 word dictionary
 www.thesauruslex.com Hyperlexicon
