Annotation guidelines
PARSEME corpora annotated for multiword expressions


Words and tokens

While the definition of an MWE inherently relies on the notion of a word, manual annotation and automatic identification of VMWEs in our task is performed on texts which are automatically tokenized. It is therefore important to understand the distinction between words and tokens in the context of VMWEs.

A word is a linguistically (notably semantically) motivated unit. The detection of words is, thus, language-dependent and annotation experts should have a clear idea of how to define it for their own language (even if this definition proves hard in general).

A token is a technical and pragmatic notion, defined according to more or less linguistically motivated clues and depending on the particular tokenization tool at hand. Note that the notion of a token is ambiguous in NLP. It can also mean an individual occurrence of a certain linguistic unit, as opposed to a type, i.e. the set of all surface realisations of a unit. In these guidelines, we refrain from using this seconf sense.

Tokens should ideally be as close as possible to words. However, in practice - due to the hardness of the (automatic) tokenization task - the relation between tokens and words is not always 1-to-1. The following cases occur:

  • A token coincides with a word:
    • مدهشة surprising, نزهة walk, ب with, قام to do
    • вземам, решение, наяве, бял, на, се, д-р
    • mít, hlad, se, úžas
    • einen, Spaziergang, machen, Überraschung
    • κάνω, μία, βόλτα, έξυπνοςkano, mia, volta, exipnos make, a, walk, clever
    • take, a, walk, astonishment
    • dar, un, paseo, sorpresa, maldecir, bienvivir
    • ibilaldi, bat, egin, ezuste
    • faire, une, promenade, étonnement
    • tóg, siúl, ionadh
    • δίδωμιdidо̄mi give give
      καλός kalos beautiful beautiful
      περί peri about about
    • napraviti, jedan, šetnja, začuđenost
    • tesz, egy, séta, meglepetés
    • mengambil, sebuah, berjalan, heran
    • fare, una, passeggiata, sorpresa
    • 取る, その, 歩く, 驚き
    • ferħ, libes, sabiħ
    • een, wandeling, maken, verrassing
    • robić to do, na on, dokładność precision
    • comer eat, uma a, guarda-chuva umbrella, antessala anteroom
    • face, o, plimbare
    • iti, na, en, sprehod, začudenost
    • седети sedeti to seat, скрштених skrštenih crossed, руку ruku hands
    • gå, på, promenad, förvåning
    • 采取, 一个, 步行, 惊愕
  • Several tokens build up one word, like in abbreviations, possessive markers, words with "accidental" separators, inflected or derived forms of foreign names, etc. In this case we speak of a multitoken word (MTW): The pipe symbol '|' indicates token separation in these examples
    • قرار|ها| decision-her her decision
    • т|.|н|. etc.
      год|. year
    • z|.|B|. for instance
      Wie geht|'|s How goes it How are you
    • κ. κύριος Mister
      υπΔρ υποψήφιος διδάκτορας PhD candidate
    • M|. Mister
      pp|. pages
      Pandora|'|s
    • A|/|A|. a la atención de for the attention of
      a|/|f|. a favor in favor
      Rte|. remitente sender
    • etab|. eta abar and so on
    • می|-|روم، آیت|-|الله، کتاب|-|ها
    • aujourd|'|hui today
    • οἷον τ'εἰμίhoion t'eimi of.what.sort and be.1SG I am able to
    • danas today
    • időjárás|-|jelentés weather forecast
    • vice|-|presidente vice-president
    • libs|et she wore
    • a|.|u|.|b|.| please
      Pandora|'|s Pandora's
    • Chomsky|'|ego of Chomsky
      SMS|-|ować to write an SMS
    • vice|-|presidente vice-president
    • prim|-|ministru prime minister
      d|-|voastră polite "you"
    • g|. Mister
      str|. pages
      le|-|to
    • FIFA|-|у FIFA|-|u FIFA.ACC
      tweet|-|овање tweet|-|ovanje to write tweets
    • EU|:|s EU's
  • One token can contain several words, like in contractions and compounds. In this case we speak of a multiword token (MWT). Identifying MWTs is important because they can be potential candidates for VMWEs. However, defining what is a word and a MWT is a hard and language-specific question and language-specific MWT tests are being designed to this end. Examples of MWTs include: See also the representation of MWTs in Universal Dependencies The precise word forms cannot always be straightforwardly deduced from the MWT containing them and vice versa, as in don't, della, du, etc.
    • وسيكتبوناها =و + س + يكتبون + هاand they are going to write it they are going to write it
    • вагон-ресторант train carriage+restaurant train buffet
    • Schulaufgabe = Schule+Aufgabe school+exercisehomework
      Apfelbaum = Apfel+Baum apple treeapple tree
    • στου = σε+του stu = se + tu to-the
      στον = σε+τον
    • don't = do+not
    • del = de+el of the from/of the
      al = a+el to+the to the
      compárese = compare+se compare SE_PARTICLE be it compared
      suicidarse = suicididar+se suicide SELF to commit suicide
    • sudurluze = sudur+luze nose+long long-nosed
      jarleku = jar(ri)+leku sit+place seat
    • کتابش=کتاب+ش
    • du = de+le from the
    • sa = i+an in the
      b'fhearr = ba+fhearr be.COND better prefer
    • καίτοι = καί + τοιkaitoi = kai + toi and indeed and indeed
    • uzbrdo = uz+brdo uphill
    • della = di+la of the
    • huiswerk = huis+werk home+workhomework
      appelboom = appel+boom apple treeapple tree
      pannenkoek = pan + koek pancake
    • Białymstoku=Białym+stoku white+slope Białystok.INST (a city name)
      robiłem=robi+łem do.3.SG.PRES+be.1.SG.PAST.AGLI did
      żeśmy = że+śmy that+be.1.PL.AGL that-we
    • neles = em+eles on them
    • într-o = într-+o in a
    • nanj = na+njega on him
    • напоље = на + поље napolje = na + polje outside
      новосадски = ново + садски novosadski = novo + sadski Novi Sad (an adjective from a city name)
    • arvsmassa = arv+massa genetic stock

While a VMWE always contains at least two words, the relation between VMWEs and tokens can be twofold:

  • A VMWE contains several tokens, whether each of them coincides with a word or not:
    • نزهة ب قام make with walk make a walk (2 words , 2 tokens)
    • вземам решение make a decision (2 words, 2 tokens)
      прочитам от корица до корица to read from cover to cover (5 words, 5 tokens)
    • eine Rede halten (2 words, 2 tokens) a speech hold to give a speech
      wie geht's (2 words, 4 tokens) how goes it how are you
    • παίρνω μία απόφασηperno mia apofasi take a decision to decide (2 words, 2 tokens)
      παίζω στα δάχτυλαpezo sta dachtyla play in-the fingers to know very well (3 words, 4 tokens)
    • to take a walk (2 words, 2 tokens)
      to open Pandora's box (3 words, possibly 5 tokens)
    • dar un paseo 2 words, 2 tokens to give a walk to take a walk
      dar por sentado 3 words, 3 tokens to give for seated to take for granted
      irse de rositas 3 words, 4 tokens to go_self of little_roses to get off scot free
    • ibilaldia egin (2 words, 2 tokens)
    • دستور داد (2 words, 2 tokens)
    • b'fhearr liom (2 words, 4 tokens) I would prefer
    • τοὺς λόγους ποιέομαιtous logous poieomai the word do.1SG to speak
    • dignuti ruke to raise hands to give up (2 words, 2 tokens), otvoriti Pandorinu kutiju open Pandora's box to face with problems (3 words, 3 tokens)
    • sétát tesz to take a walk (2 words, 2 tokens)
    • tenere un discorso (2 words, 2 tokens) hold a speech to give a speech
      cavalcare l'onda (3 words, 4 tokens) ride the wave ride the wave
    • kien idur fuq il-fatt turns on the fact
    • een wandeling maken (2 words, 2 tokens) a walk make to talk a walk
    • robi z igły widły make.3.SG a pitchfork out of a needle he makes a mountain out of a molehill (4 words, 4 tokens)
      robił|em z igły widły made.3.SG.M1+be.1.SG.AGL a pitchfork out of a needle I made a mountain out of a molehill (4 words, 5 tokens)
    • dar uma caminhada to give a walk (2 words, 2 tokens)
      cair de pára-quedas to fall with parachute to arrive unprepared in the middle of a situation (3 words, possibly 5 tokens) According to new orthography rules, this word would be written 'paraquedas'. Old spelling may still be found in annotated texts, though.
      queixar-se-ia complain-self-would would complain (2 words, possibly 5 tokens)
    • a da ortul popii to die (3 words, 3 tokens)
    • klicati jelene to call cerfs to vomit (2 words, 2 tokens)
      vreči puško v koruzo throw a rifle in the corn to give up (4 words, 4 tokens)
    • данути душом danuti dušom to breathe soul to feel relieved (2 words, 2 tokens)
      причати на|памет pričati na|pamet to talk by heart to talk not relying on facts (3 words, 2 tokens)
    • hålla ett tal (2 words, 2 tokens) hold a speech to give a speech
    • 一 个 决定 (2 words, 2 tokens) do one CL decision to make a decision
  • A VMWE contains one (multiword) token:
    • no example found for Arabic
    • no example found for Bulgarian
    • vorbereiten to pre-arrange to prepare
      anfangen at-catch to begin
    • έδωσα-πήρα gave-1SG took-1SG I tried hard
    • to pretty-print
    • suicidarse suicide_self to commit suicide
    • n.a.
    • court-circuiter to short circuit
    • προσ-άγωpros-agо̄ towards lead.1SG to lead towards
    • pripremiti to pre-arrange to prepare
    • kinyír out.cut to kill
    • corto-circuitare to short circuit suicidarsi suicide_self to commit suicide
    • voorbereiden to pre-arrange to prepare
      aanvangen at-catch to begin
    • no example found for Polish
    • queixar-se-ia compain-SELF-would would complain
    • a se-ndura RCLI.ACC-have.the.heart to have the heart
    • pripraviti to pre-arrange to prepare
    • no example found for Serbian
    • klargöra clear-make clarify påpeka on-point point out

Note finally that multitoken words are not considered verbal MWEs since they contain one (multitoken) word only:

  • no example found for Bulgarian
  • ??
  • αερολογώaerologo air+talk to talk aimlessly
  • n.a.
  • odolustu blood+empty to bleed
  • λογοποιέομαι logopoieomai word-do to speak
  • SMS-ati to write an SMS
  • anteporre to put + in front of
  • SMS-ować to write an SMS
  • pós-datar to post-date
  • a binedispunewell-disposeto cheer up
  • SMS-jati to write an SMS
  • SMS-овати SMS-ovati to write an SMS

Whenever the distinction between a word and a token is judged by a particular language team as hard to tackle, a possible option is to consider these two notions equivalent for the needs of this shared task.


An error has occured !



PARSEME corpora annotation guidelines version 1.3.6 stable version, last updated on September 20, 2022