Annotation guidelines
corpora annotated for multiword expressions
Words and tokens
While the definition of an MWE inherently relies on the notion of a word, manual annotation and automatic identification of VMWEs in our task is performed on texts which are automatically tokenized. It is therefore important to understand the distinction between words and tokens in the context of VMWEs.
A word is a linguistically (notably semantically) motivated unit. The detection of words is, thus, language-dependent and annotation experts should have a clear idea of how to define it for their own language (even if this definition proves hard in general).
A token is a technical and pragmatic notion, defined according to more or less linguistically motivated clues and depending on the particular tokenization tool at hand. Note that the notion of a token is ambiguous in NLP. It can also mean an individual occurrence of a certain linguistic unit, as opposed to a type, i.e. the set of all surface realisations of a unit. In these guidelines, we refrain from using this seconf sense.
Tokens should ideally be as close as possible to words. However, in practice - due to the hardness of the (automatic) tokenization task - the relation between tokens and words is not always 1-to-1. The following cases occur:
- A token coincides with a word:
- Several tokens build up one word, like in abbreviations, possessive markers, words with "accidental" separators, inflected or derived forms of foreign names, etc. In this case we speak of a multitoken word (MTW): The pipe symbol '|' indicates token separation in these examples
- One token can contain several words, like in contractions and compounds. In this case we speak of a multiword token (MWT). Identifying MWTs is important because they can be potential candidates for VMWEs. However, defining what is a word and a MWT is a hard and language-specific question and language-specific MWT tests are being designed to this end. Examples of MWTs include: See also the representation of MWTs in Universal Dependencies The precise word forms cannot always be straightforwardly deduced from the MWT containing them and vice versa, as in don't, della, du, etc.
καλός kalos beautiful beautiful
περί peri about about
год|. year
Wie geht|'|s How goes it How are you
υπΔρ υποψήφιος διδάκτορας PhD candidate
pp|. pages
Pandora|'|s
a|/|f|. a favor in favor
Rte|. remitente sender
Pandora|'|s Pandora's
SMS|-|ować to write an SMS
d|-|voastră polite "you"
str|. pages
le|-|to
tweet|-|овање tweet|-|ovanje to write tweets
Apfelbaum = Apfel+Baum apple treeapple tree
στον = σε+τον
al = a+el to+the to the
compárese = compare+se compare SE_PARTICLE be it compared
suicidarse = suicididar+se suicide SELF to commit suicide
jarleku = jar(ri)+leku sit+place seat
b'fhearr = ba+fhearr be.COND better prefer
appelboom = appel+boom apple treeapple tree
pannenkoek = pan + koek pancake
robiłem=robi+łem do.3.SG.PRES+be.1.SG.PAST.AGLI did
żeśmy = że+śmy that+be.1.PL.AGL that-we
новосадски = ново + садски novosadski = novo + sadski Novi Sad (an adjective from a city name)
While a VMWE always contains at least two words, the relation between VMWEs and tokens can be twofold:
- A VMWE contains several tokens, whether each of them coincides with a word or not:
- A VMWE contains one (multiword) token:
прочитам от корица до корица to read from cover to cover (5 words, 5 tokens)
wie geht's (2 words, 4 tokens) how goes it how are you
παίζω στα δάχτυλαpezo sta dachtyla play in-the fingers to know very well (3 words, 4 tokens)
to open Pandora's box (3 words, possibly 5 tokens)
dar por sentado 3 words, 3 tokens to give for seated to take for granted
irse de rositas 3 words, 4 tokens to go_self of little_roses to get off scot free
cavalcare l'onda (3 words, 4 tokens) ride the wave ride the wave
robił|em z igły widły made.3.SG.M1+be.1.SG.AGL a pitchfork out of a needle I made a mountain out of a molehill (4 words, 5 tokens)
cair de pára-quedas to fall with parachute to arrive unprepared in the middle of a situation (3 words, possibly 5 tokens) According to new orthography rules, this word would be written 'paraquedas'. Old spelling may still be found in annotated texts, though.
queixar-se-ia complain-self-would would complain (2 words, possibly 5 tokens)
vreči puško v koruzo throw a rifle in the corn to give up (4 words, 4 tokens)
hedh një sy (3 words, 3 tokens) throw an eye take a look
причати на|памет pričati na|pamet to talk by heart to talk not relying on facts (3 words, 2 tokens)
anfangen at-catch to begin
aanvangen at-catch to begin
Note finally that multitoken words are not considered verbal MWEs since they contain one (multitoken) word only:
Whenever the distinction between a word and a token is judged by a particular language team as hard to tackle, a possible option is to consider these two notions equivalent for the needs of this shared task.