Transcript Slide 1

Languages are bridges … not barriers
ReferNet Technical Meeting
24-25 September 2009
Chiara Carlucci – CEDEFOP Library
Languages are bridges …
not barriers
What it is …
Why to use it …
How to use it …
What else ..
What
Is there any place left for thesauri in this
new information retrieval environment?
What
for sure there is a place for thesauri but they
must change in order to continue to be of
value. A true thesaurus has equivalence
relationships but it also supports other
kinds of relationship and provides
navigation assistance by means of scope
notes and other aids.
What
A thesaurus suggest other ways of
expressing an idea which is already in the
user's mind and remind the user of related
ideas that might be valuable in searching.
What
It’s useful recounts some classic moments
of indexation because the documents are
changing rapidly, because the habit of
making the same things and leads to
repetitive behavior and not considered,
because the thesaurus is to be used as a
thesaurus !
What
it must be remembered that, though a thesaurus
appears to be made up of a natural language
terms, it is an artificial language, a controlled
vocabulary with a limited number of descriptors
the meaning of each being understood through
the:
– context provided by the descriptors as a whole
in a bibliographical context (as VET bib) these
information provided by the whole system of
descriptors are also helped by
– the title of the document
– the abstract of the document
What
• Is not
– a dictionary which contains definitions and
pronunciations. Unlike a dictionary, a thesaurus entry
does not define words.
– a glossary which contains explanations of concepts
relevant to a certain field of study or action.
– a lexicon because the lexicon of a language is its
vocabulary, including its words and expressions.
– a vocabulary which is the set of words they are
familiar with in a language. A vocabulary usually
grows and evolves with age, and serves as a useful
and fundamental tool for communication and
acquiring knowledge.
What
The thesaurus is a thesaurus
What
The thesaurus is a thesaurus
With his propre Hierarchical relationships that
are used to indicate terms which are narrower
and broader in scope. A "Broader Term" (BT) is a
more general term, e.g. “Apparatus” is a
generalization of “Computers”. Reciprocally, a
Narrower Term (NT) is a more specific term, e.g.
“Digital Computer” is a specialization of
“Computer”. BT and NT are reciprocals; a
broader term necessarily implies at least one
other term which is narrower. BT and NT are
used to indicate class relationships, as well as
part-whole relationships.
What
The thesaurus is a thesaurus
With his propre Equivalency relationship that
are used primarily to connect synonyms and
near-synonyms. Use (USE) and Used For (UF)
indicators are used when an authorized term is
to be used for another, unauthorized, term.
Reciprocally, the entry for the unauthorized term
would have a indicator "USE". Unauthorized
terms are often called "entry vocabulary", "entry
points", "lead-in terms", or "non-preferred
terms", pointing to the authorized term (also
referred to as the Preferred Term or Descriptor)
that has been chosen to stand for the concept.
What
The thesaurus is a thesaurus
With his propre Associative relationships that are
used to connect two related terms whose relationship
is neither hierarchical nor equivalent. This relationship
is described by the indicator "Related Term" (RT).
Associative relationships should be applied with
caution, since excessive use of RT will reduce
specificity in searches. Consider the following: if the
typical user is searching with term "A", would they
also want resources tagged with term "B"? If the
answer is no, then an associative relationship should
not be established.
Why
• To translate the concept you are looking for into keywords
• Multilingualism and standardisation are the main
advantages of this powerful indexing tool covering the
fields of VET
• The thesaurus is an operational tool used to retrieve
documents according to their semantic content
• Thesaurus must be delivered to users to identify their
information needs
• Thesaurus provides a conceptual framework for
understanding reality through graphic presentations that
preserve the specificity
• It presents in an unambiguous way the conceptual
content of documents.
Why
• A thesaurus is fit for the digital environment
to show his versatility
• Is open to the interoperability information
because the thesaurus context is not only
an operating environment but an
organizational criterion
• It can be integrated with other tools of
information retrieval
Why
research in systems of
unstructured information
→ web
Why
ETT is used to index and represent the content of a
document. It is mostly used by documentalists and
librarians to identify the concepts laid down in the text
and to represent them by attributing keywords from the
thesaurus. This operation enables extracting the
relevant records from a collection of bibliographic
references or from a full-text documentary database to
answer the user’s query. End-users can combine ETT
descriptors in order to represent their search query. The
indexation through ETT enables all documents on the
same subject to be retrieved through a single query.
Why
ETT is useful for taxonomy and semantic web
applications. The main role of a thesaurus is to
standardise the indexing process in order to
make searches simpler, more efficient and
consistent regardless of the language of the
query. It is a multilingual conceptual thesaurus
which strives to satisfy both the Community and
national needs on a wide range of subjects.
Each descriptor is related to one concept in
each of the languages.
Why
Another interesting option offered by ETT is
the possibility for users to ask questions in
one language and retrieve the answers in
different languages and this Google
doesn’t do, or not yet !!
Why
In this case the descriptor
‘transparency of qualifications’
represents a precise concept and
can be able to retries many web
pages, not necessarily documents,
that have the descriptor in the exact
form in the text
Why
In this case ‘transparency of qualifications’ is more than a descriptor:
is a concept. We can find documents relating to the subject even if: 1.
the term is not within the text 2. the document is in a different
language.
Why
ETT is also used in Cedefop website for
automatic categorisation or classification of
documents in websites and in Library’s
reference desk to categorize user’s questions. A
simple click enables crosslingual information
access to the translation of a descriptor or of the
complete semantic chain of a descriptor. These
advanced options open the door to many crosslingual applications, such as calculating
document similarity across languages.
How
Indexing with the ETT’s update version
… knowing how something is stored makes
finding it easier
How
Hierarchical
presentation
Alphabetical
presentation with
semantic relation
KWIC index
How
The main, word-by-word
alphabetical display the most
familiar since it provides a
variety of information for each
descriptor. The term’s main
entry in the alphabetical
display shows the appropriate
coordination.
This includes a SN, a BT and
NT, USE and UF relations, RT
But be careful … this
approach is easy to
understand but non so easy
for end-user for example the
fact that BT and NT mean that
two terms are related
hierarchically is obvious only
to specialists !
How
Showing to the users
hierarchical structures is
a useful mechanism for
query expansion also
because …
- users with varying levels
of domain knowledge
make use of thesauri in
different ways
- thesauri are capable of
providing end-users with
additional, useful terms
for query formulation and
expansion
How
A KWIC index is formed by
sorting and aligning the words
within an article title to allow
each word (except the stop
words) in titles to be
searchable alphabetically in
the index. It was a useful
indexing method for technical
manuals before computerized
full text search became
common. The term permuted
index is another name for a
KWIC index, referring to the
fact that it indexes all cyclic
permutations of the headings.
A permutation is called a
cyclic permutation if and only
if it will be constructed with
exactly 1 cycle A cyclic
permutation is built from one
or more sets of elements in
cyclic order.
How
Indexing with the ETT’s update version
• New 465 descriptors = have added to the
thesaurus since 2008 edition so you can
not search previous literature using these
descriptors
Oldest literature on topics represented by these
terms is searchable using related descriptors.
How
Indexing with the ETT’s update version
• 415 Deleted descriptors = are non longer
used in indexing but they may be used for
searching data base entries prior to ETT’s
2008 edition
More recent literature on topics represented by
these terms is searchable using related
descriptors.
How
How can I add the new descriptors using
VET det ?
1) introduce the new descriptors (p.16-19 of ETT
printed version) in the field notes preceding of
the word, NEWDESCRIPTOR, and separating
these with commas.
i.e. Notes field: NEWDESCRIPTOR certification of learning
outcomes, key competences
– If the new descriptor is a main descriptor
NEWMAINDESCRIPTOR at the beginning
2) not to introduce the deleted descriptors (p.
20-22 of ETT printed version)
How
Fundamental, basic, classic indexing
rules really important because VEt BIB
contains 70.000 records!!!
Index ONLY what is in the document and Index at the
LEVEL of specificity of the document
1.
Statements or assumptions are not indexed
How
Fundamental indexing rules
2. Very general descriptors are not used unless the
document covers a topic very broadly
3. Main descriptor cover the main focus or subject of a
document
4. Other descriptors indicate less important aspects within
the document
How
Fundamental indexing rules
5. ETT avoids ‘indexing up’ to a broader
descriptor when an appropriate more
specific exists
How
Fundamental indexing rules
How
Fundamental indexing rules
• Indexing is complementary to information
found in other parts of the document
(mainly title and abstract)
How
Fundamental indexing rules
• The number of the descriptors should be
proportioned with the number of pages
How
Fundamental indexing rules
How
Fundamental indexing rules
• “Indexable” concepts are translated into
descriptors using the thesaurus helps
maintain consistency and prevents
proliferation of concepts
How
Fundamental indexing rules
• Thus a single descriptor may be imprecise
even ambiguous while the greater the
number of descriptors used together the
greater the precision
How
Fundamental indexing rules
• This world precision is used in a technical
sense to mean the ratio of relevant to
irrelevant documents in a retrieved set
How
Fundamental indexing rules
• The word recall is used to mean the ratio
of relevant documents retrieved to those
wich are relevant and not retrieved
What else …
… for the future
Permitting the searcher to switch between
navigating the thesaurus and searching
the database can only improve access an
obvious way in which a thesaurus can be
applied directly in retrieval is to use the
relationship as a means of expanding the
search. Research, however, has shown
that these relationship must be used with
caution (precision/recall)
What else …
… for the future
In general, expanding a search to include
the narrower terms tends to improve recall
without great sacrifice in precision.
Expanding to include broader or related
terms while does improve recall typically
has a significant negative impact on
precision.
What else …
… for the future
•
How is it possible to remain positive
about the need for continued use of
thesauri ?
Because only a thesaurus can become the
basis of a more extensive semantic
network that provide information not just
on what terms are used in indexing but
on how they are used within the system.