CLARIN: een introductie

Download Report

Transcript CLARIN: een introductie

CLARIN-NL
ISOcat workshop 2013
part 2 (02-10-2013)
Ineke Schuurman
Menzo Windhouwer
Issues call 4
Issues brought up by participants call 4
– Which elements are to be included in ISOcat?
– What can be expressed in ISOcat (and other
cats)
– Names, identifiers, and the use of capitals,
special characters, etc
– Easy linking of closed and simple DCs
– Hierarchical structure between DCs
– Can new profile be added?
Other issues
• Issues brought up in previous calls
– Type of DC
– When to create a new DC/adopt an existing one
– When to create several DCSs
– Name of DC, several DCs with same name
– How to deal with larger amounts of data
What to include?
• ALL concepts dealing with linguistics/
metadata
– Van Dale EN-NE
include (overgankelijk werkwoord)
1) omvatten
2) (mede) opnemen
• 'overgankelijk werkwoord' / 'transitive verb' is
to be included, same for 'overg.ww', 'trns.v.'
• One and the same DC! (but separate parts)
What to include?
Have a look at ‘transitive verb’
• Several entries in ISOcat
– DC-1405
A verb which takes a direct object; that is, a verb that
expresses an action which directly affects another
person or thing.
– DC-3532
A transitive verb is a verb that takes a direct object,
and describes a relation between two participants
[Crystal 1997: 397; Payne 1997: 171]
– And several more, so... which one to select?
Names - identifiers
• Identifier
– No spaces (properNoun)
– Camelcase (properNoun)
– Start with small character (properNoun), not
with number, punctuation character
– Such characters may appear elsewhere in
the identifier
Names - identifiers
• Name
– Multi-word units allowed (proper noun)
– Several names allowed (in same name
section), one per entry
– Use the most common name first,
alternative names in further entries (same
languages)
– Use common spelling
– Abbreviations etc in ‘Data Element Name’
Same name
• Not really a problem, not even when
coming with the same profile
• PositivePolarity
– In general, positive polarity refers to an
assertion that contains no marker of negation
[Crystal 1980: 299]. (DC-3405)
– the property of a word or concept to express
positive sentiment (myDC-xx)
• Whether you can reuse DC-3405 depends
on your use of the concept!
Same name
• Do not avoid reuse of a name when it is
the name commonly used!
• Another type of duplicate names where one
concept entails the other one:
– meewerkend voorwerp (indirect object)
– meewerkend en belanghebbend voorwerp
– event (also called 'eventuality', and including
'state')
– event (sister of 'state')
Identical identifiers
• Identical identifiers will be accepted by the system!
– There are at least 4 identifiers ‘noun’
• Rule: start with small character
• In that respect
– X-qatalClause should become x-qatalClause, even when
the latter exist as well
• Difference is mainly to be made in
– Name
– Definition
• Identifier =/= PID (persistent identifier), the latter
are unique (http://www.isocat.org/datcat/DC-1345)
Adoption
• When (not) to adopt an existing DC
– It should ‘match’ with the way you use a
specific notion in your annotation scheme,
application, …
– It should come with the same profile and
type
• That being said
– Reuse a CLARIN NL/VL DC whenever
possible (contact Ineke when such a
definition is incorrect)
What defines a good DC?
Correct definition
NOT (unless all concepts defined in
ISOcat)
Actor (DC-4146)
a participant in an action or process
Question: is an addressee to be
considered an actor? (used in DC-4158,
no proper definition yet)
What defines a good DC?
Reusable and correct definition
NOT
conversation (DC-2661)
Communication event with more than two participants
mother tongue (DC-2955)
[…] a speaker’s mother tongue
neuter (myDC-XX)
In CGN the gender … / In Dutch …
What defines a good DC?
Meaningful definition
NOT
annotation format (DC-2562)
Specifies the annotation format that is used …
source language (DC-2494)
Indicates if a language is a source language
mother tongue (DC-2955)
[…] a speaker’s mother tongue
Not that good examples
Mother tongue (DC-2955)
Specifies whether the language is a speaker’s mother
tongue
Mother’s language (DC-4516)
[…] NOT necessarily the mother tongue […]
- There is no definition of concept ‘mother tongue’
(Relation with /home language/ , /primary language/,
/heritage language/? And what about ‘father tongue’?)
- And why ‘speaker’?
Rule
Make your definition
• as general as possible
• as specific as necessary
Linking closed - simple
Not always that simple when there are
many entries within a profile
– Selected profile determines the number of
choices
– You can order them: name, pid, owner, …
• When you don’t find the ‘simple’ you
need, have a look with other profiles !
(esp. ‘undecided’)
Standards
• Within ISOcat currently there are little or no
standards,
Therefore
• CLARIN NL and VL will set up their own set of
‘standardized DCs’, Ineke will be in charge,
selecting new flag “recommended by CLARIN
NL/VL” (issue: often no correct profile
selected, still showing the ‘undecided’ one)
Standards
Another issue wrt standards 'included' in ISOcat
- Athens Core DC's (recommended by
metadata/CMDI): we are to adapt them in order to
avoid tautologies and/or correct smaller ‘errors’
Target language: indicates if the language is the target
language
Conversation: […] three or more participants
Same may be necessary for TEI Headers etc.
Contact Ineke in case you are not sure whether you
can reuse such DCs
DC/DCS and profile
• Profiles are not added automatically, a
DCS may contain elements with various
profiles (although you may decide to
create several DCSs) (do select proper names!)
• In case the profile you need is not yet
available, contact Menzo and Ineke
Part B: do’s & don’ts
Do’s:
• Create a DCS for your scheme (name project,
ann.scheme, …), it is to contain all your DCs (also
adopted ones)
• Provide clear definition (short, to the point) for your
scheme, application, ….
• Take care not to leave concepts used in your
definition undefined or vague
• Use appropriate vocabulary (per profile)
• Check ‘adopted’ DC’s regularly till
standardization/recommendation (history-button !!!)
Do’s (continued)
When creating a DC, fill out
• Justification: used in XYZ, part of tagset N
• Language section
– Always English language section (advice: do not
create sections in other languages (NL!) before
Ineke has seen your input)
– Strong recommendation: sections for object
language(s), for working language manual
– Sections in the various languages should match
(+/- be translations of each other)
Do’s (continued)
When creating a DC, fill out
• Example section
– Note that *negative* examples may be very
helpful! (for nouns (CGN): jongens,
mannen, niet: gelovigen (is form of ADJ))
Example sections
Suppose you want to illustrate a German
phenomenon:
• Ex.sec. in EN language section
– German ex with transl in English
• Ex.sec. in NL language section
– German ex with transl in Dutch
• Ex.sec. in EN linguistic section
– EN example
• Ex.sec. in NL linguistic section
– NL example with translation in English
Don’ts
• Confuse Language and Linguistic section
– Latter contains language specific values for
closed domains
•
•
•
•
•
•
Be (too) language specific in definition
Mention scheme in definition
Use several definitions in one DC
Circular definitions
Rely on authority
Rely on standardized status
– Definition should fit YOUR scheme, etc
Procedure - 1
Procedure - 2
.
--
End --