CLARIN: een introductie

Download Report

Transcript CLARIN: een introductie

CLARIN-NL
ISOcat workshop 2012
part 2 (10-10-2012)
Ineke Schuurman
Menzo Windhouwer
• Issues brought up by participants
– Which elements are to be included in ISOcat
– (CLARIN) standards, TEI etc
– Type of DC
– When to create a new DC/adapt an existing one
– When to create several DCSs
– Name of DC, several DCs with same name
– How to deal with larger amounts of data
What to include?
• ALL concepts dealing with linguistics/
metadata
– Van Dale EN-NE
include (overgankelijk werkwoord)
1) omvatten
2) (mede) opnemen
==> 'overgankelijk werkwoord' / 'transitive verb' is to
be included, same for 'overg.ww', 'trns.v.'
• One and the same DC!
What to include?
‘transitive verb’
• Several entries in ISOcat
– DC-1405
A verb which takes a direct object; that is, a verb that
expresses an action which directly affects another
person or thing.
– DC-3532
A transitive verb is a verb that takes a direct object,
and describes a relation between two participants
[Crystal 1997: 397; Payne 1997: 171]
– And several more, so... which one to select?
• When (not) to adopt an existing DC
– It should ‘match’ with the way you use a
specific notion in your annotation scheme,
application, …
– It should come with the same profile and
type
• That being said
– Reuse a CLARIN NL/VL DC when possible
(contact Ineke when such a definition is
incorrect)
Same name
• Not really a problem when it are good
DCs, not even when coming with the same
profile
• PositivePolarity
– In general, positive polarity refers to an
assertion that contains no marker of negation
[Crystal 1980: 299]. (DC-3405)
– the property of a word or concept to express
positive sentiment (myDC-xx)
• Whether you can reuse DC-3405 depends
on your use of the concept!
Same name
• Do not avoid reuse of a name when it is the
name commonly used!
• Another type of duplicate names where one
concept entails the other one:
– meewerkend voorwerp
– meewerkend en belanghebbend voorwerp
– event (also called 'eventuality', and including
'state')
– event (sister of 'state')
What defines a good DC?
Reusable definition
NOT
conversation (DC-2661)
Communication event with more than two
participants
mother tongue (DC-2955)
[…] a speaker’s mother tongue
What defines a good DC?
Correct definition
NOT (?)
Actor (DC-4146)
a participant in an action or process
Question: is an addressee to be
considered an actor? (used in DC-4158,
no proper definition yet)
What defines a good DC?
Meaningful definition
NOT
annotation format (DC-2562)
Specifies the annotation format that is used …
source language (DC-2494)
Indicates if a language is a source language
Not that good examples
Mother tongue (DC-2955)
Specifies whether the language is a speaker’s mother
tongue
Mother’s language (DC-4516)
[…] NOT necessarily the mother tongue […]
- There is no definition of concept ‘mother tongue’
(Relation with /home language/ , /primary language/,
/heritage language/?)
- And why ‘speaker’?
Rule
Make your definition
• as general as possible
• as specific as necessary
Standards
• Within ISOcat currently there are little or
no standards,
Therefore
• CLARIN NL and VL will set up their own
set of ‘standardized DCs’, Ineke will be in
charge, selecting new flag
“recommended by CLARIN NL/VL”
Standards
Another issue wrt standards 'included' in ISOcat
- Athens Core DC's (recommended by
metadata/CMDI): we are currently adapting
them in order to avoid tautologies and/or
correct smaller ‘errors’
Target language: indicates if the language is
the target language
Conversation: […] three or more participants
Same may be necessary for TEI Headers etc
DC/DCS and profile
• Profiles are not added automatically, a
DCS may contain elements with various
profiles (although you may decide to
create several DCSs) (do select proper names!)
• In case the profile you need is not yet
available, contact Menzo and Ineke
Part B: do’s & don’ts
Do’s:
• Create a DCS for your scheme (name
project, ann.scheme, …)
• Provide clear definition (short, to the point)
for your scheme, application, ….
• Take care not to leave concepts used in your
definition undefined or vague
• Use appropriate vocabulary (per profile)
• Check ‘adopted’ DC’s regularly till
standardization !
Do’s (continued)
When creating a DC, fill out
• Justification: used in XYZ, part of tagset
N
• Language section
– Always English language section
– Strong recommendation: sections for object
language(s), for working language manual
– Sections in the various languages should
match (+/- be translations of each other)
Do’s (continued)
When creating a DC, fill out
• Example section
– Note that *negative* examples may be very
helpful! (jongens, mannen, niet: gelovigen
(is form of ADJ))
Example sections
Suppose you want to illustrate a German
phenomenon:
• Ex.sec. in EN language section
– German ex with transl in English
• Ex.sec. in NL language section
– German ex with transl in Dutch
• Ex.sec. in EN linguistic section
– EN example
• Ex.sec. in NL linguistic section
– NL example with translation in English
Don’ts
• Confuse Language and Linguistic section
– Latter contains language specific values for
closed domains
•
•
•
•
•
•
Be (too) language specific in definition
Mention scheme in definition
Use several definitions in one DC
Circular definitions
Rely on authority
Rely on standardized status
– Definition should fit YOUR scheme, etc
Procedure - 1
Procedure - 2
.
--
End --