Free construction of a Swedish dictionary of synonyms
Download
Report
Transcript Free construction of a Swedish dictionary of synonyms
Free Swedish Word Lists
or Hackers’ BLARK
Viggo Kann
KTH, Stockholm
GSLT meeting January 26, 2008
What is a free language resource?
Anyone can use it in an application
Anyone can study it and modify it
Anyone can take a copy of it
Anyone can improve it, release the
improvements to the public, so that
the whole community benefits
(based on four freedoms of free software,
Richard Stallman)
Strong free software culture
GNU project
FSF – Free Software Foundation
GPL – GNU General Public License
OSI – Open Software Initiative
Linux, TeX, Emacs, GCC, MySQL, PHP,
Java, Python, Firefox
First meeting of the Free Swedish
Words group at KTH January 16
11 persons from around Sweden
Lars Aronsson: project Runeberg and
Swedish Wikipedia (Wiktionary)
Lars Törnquist and Sven Lange:
Swedish thesaurus built on Bring (1930)
Christian Mattson: Lexin dictionaries
Niklas Johansson: Spelling error
detection and correction in OpenOffice
Göran Andersson: DSSO – The large
Swedish word list
Viggo Kann: Stava, Granskatagger,
Synlex, Tvärslå Nordic dictionary
Per Starrbäck, Leif-Jöran Olsson,
Tomas Padron-McCarthy, Erik Geijer
Plans for more free words
Swedish synonyms in OpenOffice
(Niklas)
Extending DSSO with synonyms,
associations etc (Göran)
Building a free Swedish-English
dictionary (Viggo)
Testing Swedish grammar checking in
Languagetool/OpenOffice
(Viggo&Niklas)
Typical ways to construct a resource
…if you are a language …if you are a free
technologist:
software hacker:
Get funding
Use other free
resources
Use resources that
are free to use for
Collect data from
researchers
lots of people using
e.g. a wiki or a web
Hire linguists to do the
form
heavy jobs
Example: Synlex
Construct a Swedish dictionary of
synonyms as a list of synonymous pairs
I don’t want to work a lot
I don’t want to pay anyone to work
The resulting list should become free
Ideas
Automatically construct a large set of
word pairs that might be synonyms
Use ten thousands of people, who are
each willing to make a small
contribution without payment, to check
the word pairs
More ideas
Use the Lexin on-line Swedish-English
dictionary web site, that had 9 millions
(now 25 M) of lookups each month
Users visit Lexin to translate words, and
are thus probably motivated to help me
Each time a user makes a lookup, give
her the opportunity to decide whether
two words are synonyms or not
My plan
1.
2.
3.
4.
Construct lots of possible synonyms
Sort out bad synonym pairs
automatically
Ask lots of users if the rest of the pairs
are good synonyms
Analyze the gradings done by the
users and decide which pairs to keep
Step 1:
Construct lots of possible synonyms
If we have access to a Swedish-English
dictionary SE and an English-Swedish
dictionary ES, try to translate each word
to English and back again to Swedish
{(w,v): y: ySE(w) vES(y)} or
{(w,v): y: ySE(w) ySE(v)}
616 000 word pairs were generated
Step 2: Remove bad synonym pairs
automatically
Use RI (Random Indexing)
[Kanerva, Kristoferson, Holst 2000]
to measure the distance between words
represented in a large vector space
Keep pairs that have small enough
distance in the vector space
Step 3: Ask lots of users if the rest of
the pairs are good synonyms
When a user has sent a word to the Lexin
dictionary he receives the translation
followed by a question like:
Are 'spread' and 'lengthen' synonyms?
Answer using a scale from 0 to 5 where 0
means 'I don’t agree' and 5 means
'I do fully agree', or answer 'I don’t know'
Step 4: Analyzing the gradings
done by the users
1.2 millions gradings were made in less
than 2 months
Grading statistics were analyzed on
several occasions
Some users sent comments
More and more interesting
gradings as time goes by
60%
50%
40%
2005
2006
2007
30%
20%
10%
0%
0
1
2
3
4
5
don't
know
Distribution of mean gradings of
word pairs
40%
35%
30%
25%
2005
2006
20%
15%
10%
5%
0%
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5
Some statistics (January 2008)
2.8 M user gradings done
75 000 pairs (graded ≥ 2) in dictionary
108 000 pairs suggested by users
62 000 unique pairs suggested
20 000 of them have been accepted
Example: Synonyms to klass (class)
5: rang (grade)
rank (rank)
slag (kind)
4: kategori (category)
stånd (social class)
årskurs (grade)
3: fack (sphere)
grad (degree)
grupp (group)
kvalitet (quality)
nivå (level)
3: sort (sort)
standard (standard)
stil (style)
2: skikt (layer)
storleksordning
(magnitude)
typ (type)
1: poäng (point)
stadga (stability)
0: uppdrag (mission)
utbilda (educate)
How to prevent abuse?
Many gradings of a word pair are
needed before it’s considered to be
good
The pair to be graded is randomly
picked from a very large list
Word pairs suggested by users are spell
checked before they are added to the
very large list
People's definition of synonymy
Exact meaning of 'synonym' wasn’t
defined
Users will grade using their intuitive
understanding of the concept of
synonymy and the words in the pair
The produced dictionary will use the
people's own definition of synonymy
Hopefully this is exactly what they want!
Links
www.dsso.se
The large Swedish word list
www.nada.kth.se/stava Spell checker
lexin.nada.kth.se/synlex.html
75 000 synonyms
sv.wiktionary.org 50 000 word dictionary
www.thesauruslex.com Hyperlexicon