Annotation guidelines (version 2.0; UNDER CONSTRUCTION)
Used by the PARSEME corpora annotated for multiword expressions


Multiword expressions

A multiword expression (MWE) is a (continuous or discontinuous) sequence of words with the following compulsory properties:

  • It contains at least two component words which are lexicalised, i.e. always realized by the same lexemes. Only these lexicalized components are annotated. For instance in he paid several important visits to the president, we annotate only the components highlighted in bold.
  • Its neutral form forms a weakly connected graph, i.e., in its dependency graph, every (lexicalized) component is achievable from every other component, if directions of the dependencies are disregarded. For instance, in the following MWE Non-neutral form the highlighted components do not form a weakly connected graph but this form in not a neutral one. When transforming it to a neutral form Non-neutral form the connectivity condition is fulfilled.
  • It shows some degree of orthographic, morphological, syntactic and/or semantic idiosyncrasy with respect to what is considered general grammar rules of a language. This condition is tested by the decision diagrams documented in in sections 5 to 9. Collocations, i.e. word co-occurrences whose idiosyncrasy is of statistical nature only (e.g. the graphic shows, drastically drop) are not considered MWEs.

Probably the most salient property of MWEs is semantic non-compositionality. In other words, it is often impossible to straightforwardly deduce the meaning of the whole unit from the meanings of its parts and from its syntactic structure. For instance, while it is easy to interpret phrases like to kick the ball or to spill some water from the words that compose them, it is almost impossible to guess, without knowing it beforehand, that to kick the bucket means 'to die' and to spill the beans actually means 'to reveal a secret'.

However, as non-compositionality is a subjective notion and is hard to test directly, we use inflexibility as a proxy in the tests. Our underlying hypothesis is that MWEs have some degree of semantic non-compositionality that implies limited flexibility.

Depending on the distribution of its neutral form, a MWE can be verbal, nominal, adjectival, adpositional, etc.

Verbal MWEs

A verbal MWE (VMWE) is a multiword expression whose neutral form is such that: (i) it has a distribution of a verb, a verbal phrase or a verbal clause, (ii) its syntactic head is a verb.

  • she paid several visits to the president
  • pūst miglu acīs to blow mist into eyes to lie, to talk nonsense
  • władza czerpie z tego korzyści propagandowe the authorities draw propaganda benefits from this the authorities reap benefits from this for propaganda

Note that reasoning in terms of neutral forms is crucial here. A MWE may occur in a variant whose distribution is non-verbal. But when its neutral form is retrieved, the verbal distribution becomes apparent, and such a MWE is considered verbal.

  • the visits which she paid to the president - the distribution of this MWE is nominal but this is not a neutral form; when neutralized the verbal distribution and the verb headedness are restored
  • czerpane z tego korzyści propagandowe - the distribution of this MWE is nominal but this is not a neutral form; when neutralized the verbal distribution and the verb headedness are restored

Conversely, some MWEs derive from VMWEs but their neutral forms are not verbal. Such MWEs are considered deverbal nominal, adjectival or adverbial MWEs:

  • Wortbruch word-break a promise which has not been hold - nominal MWE deriving from ein Wort brechen
  • a take-off - nominal MWE deriving from to take off
    (a) run-down (apartment) - adjectival MWE deriving from to run down
  • la prise en compte the fact of taking into account - nominal MWE deriving from prendre en comptetake into account
    une mise à disposition the fact of making available - nominal MWE deriving from mettre à dispositionmake available
  • zabawa czyimś kosztem a play at someone else's expenses - nominal MWE derived from bawić się czyimś kosztem to enjoy oneself at someone else's expenses

Some other MWEs contain verbs but are not derived from VMWEs and have a non-verbal distribution (nominal, adjectival, adverbial, etc.). These candidates are assigned the category which conforms with their distribution: nominal MWEs, modifier MWEs or functional MWEs.

  • Vergiss-mein-nicht forget-me-notforget-me-not - nominal MWE
  • forget-me-not - nominal MWE
  • peut-être may-be maybe - adverbial MWE
    porte-feuille carry-sheets wallet - adverbial MWE
    couru d'avance run in advance forgone/predictable - adjectival MWE
  • pūt un palaid blow and let gofrivolous, absent-minded - adjectival MWE
  • vergeet-mij-niet forget-me-not forget-me-not - nominal MWE
  • (zrobić coś za) Bóg-zapłać (do something for a) God-pay to do something for free nominal MWE

Nominal MWEs

A nominal MWE (NMWE) is a multiword expression whose neutral form has a distribution of a noun.

  • I’ll have a hot dog for lunch.
    This was a real wild goose chase a foolish and hopeless search for or pursuit of something unattainable.
  • Leurs armes blanches sont en acier inoxydable Their white weapons are made of stainless steel Their bladed weapons are made of stainless steel
  • zili brīnumi blue wonder something unusual, surprising
  • Hij lust geen blinde vink He doesn't like 'blinde vink' He doesn't like blinde vink (Dutch meat)
  • Ostatnia transakcja okazała się dla firmy gwoździem do trumny The last transaction turned out for the company a nail to the coffin The last transaction turned out to be an event that caused the failure of the company
    W antykwariacie znalazła kilka białych kruków In the antique shop she found a few white ravens In the antique shop she found a few very rare books

It may or may not be headed by a noun:

  • Vergiss-mein-nicht forget-me-not
  • forget-me-not
  • porte-feuille carry-sheets wallet
  • vergeet-mij-niet forget-me-not forget-me-not
  • (zrobić coś za) Bóg-zapłać (do something for a) God-pay to do something for free

A major challenge in annotating NMWEs is to distinguish them from proper names and multiword terms. Proper names have a special semantic status because they function as names of entities rather than their descriptions. Proper names may contain MWEs and vice versa but most proper names do not pass the linguistic tests proposed here and thus we do not consider them MWEs. We defined specific tests (SPECIF-REF, NAMING-CONV and SEM-TYPE) to distinguish proper names from MWEs.

  • John Smith - entity name, not a MWE
    UN Secretary-General - entity name containing a NMWE
  • Agnieszka Kownacka - entity name, not a MWE
    Jego Królewska Mość Król Belgii His Royal Majesty the King of Belgium - entity name containing a NMWE

Mutiword terms overlap with MWEs. Examples include:

  • white gold an alloy consisting of gold and platinum or nickel
  • pied d'athlète athlete's foot skin infection of the feet caused by a fungus
  • acs ābols apple of an eye eyeball
  • biały metal white metal alloy containing approximately 88% of tin
    rok świetlny light year a distance covered by a light ray in 1 year

But many mutiword terms do not pass inflexibility tests either and we consider them semantically compositional (i.e. non-MWEs), as in:

  • bipolar disorder
  • affection respiratoire aiguë acute respiratory affection acute respiratory disease
  • ēšanas traucējumieating disorders
  • obiektowy język programowania object programming language object-oriented programming language

Note that some MWEs whose internal structure is the one of a nominal phrase have a distribution of an adverb, an adposition or an adjective, etc. Those should not be annotated as NMWEs but as functional/adjectival/adverbial:

  • sailing head to wind sailing with the bow of the boat facing directly into the wind
  • Ils marchent main dans la main They are walking hand in hand They are walking holding each other's hand - adverbial MWE
    je fais ça toute seule, les doigts dans le nez I do it alone, fingers in my nose I do it easily
  • par mata tiesu - adverbial MWE
  • Zij lopen hand in hand They are walking hand in hand They are walking holding each other's hand - adverbial MWE
  • Wygrali tę wojnę psim swędem They won this war by a dog's stench/itch They won this war by a lot of luck - adverbial MWE

Recall that a MWE may be a multiword token. Deciding what is a word is notoriously difficult, especially in languages exhibiting frequent closed compounds, like Germanic languages. Closed compounds (i.e. compounds in which components are spelled together, possibly with some phonological changes on the border of morphemes) can be idiomatic:

  • Meer|schweinchen little see pig cobaye
  • passerby - inflects like a nominal phrase: passers|by
  • bonhomme good man fellow - inflects like a nominal phrase: bons|hommes
  • rzeczpospolita thing popular republic - inflects like a nominal phrase: rzeczy|pospolitej

or fully compositional:

  • Schul|jahr school year
  • school|jaar school year

or partly idiomatic and partly compositional:

We consider closed compounds as containing several words, and submit them to the PARSEME decision diagrams and annotate them as NMWEs if the tests are passed. We hypothesize that, most of the time, it is straightforward to annotators to identify word boundaries in a closed compound. If this is not the case, language-specific rules must be added. Splitting closed compounds directly in the corpus, if they are not split already, is not recommended, so as to keep the tokenization consistent with the underlying morpho-syntactic annotation.

See also the UniDive task on harmonizing the definition of a “syntactic word” across languages.

It happens that only part of a closed compound is idiomatic. For such cases, a UD/PARSEME white paper proposes subtoken spans, e.g.:

  • Hauptrolle spielen to play the main role - Role spielen to play a role is a VMWE, but the noun Role role can be freely modified, which yields a closed compound like Hauptrole main role

This feature is not implemented yet. In the meantime, we suggest annotating the whole token as belonging to the MWE.

We consider that nominal MWEs embrace pronominal MWEs:

  • I saw just a few
    I expect no one to come
    we love each other
  • dažs labs few good somebody
  • powtarzał ciągle to samo he repeated always this the same he repeated always the same

Similarly to functional MWEs (below), pronominal MWEs constitute closed lists of cases, and their inflexibility is hard to test. They are also frequently ambiguous with idiomatic determiners.

  • I saw a few - a PronID
    I saw a few examples - a DetID
  • dažs labs jūtas svarīgs few good feels important somebody feels important - a PronMWE
    dažs labs šoferis jūtas svarīgs few good driver feels important some drivers feel important - a DetMWE
  • Ik gaf een paar voorbeelden I gave a few examples - a DetID
  • powtarzał ciągle to samo he repeated always this the same he repeated always the same - a PronMWE
    powtarzał ciągle to samo pytanie he repeated always this the same question he repeated always the same question - a DetMWE

Adjectival and adverbial MWEs

The class of adjectival and adverbial MWEs (AMWEs) includes adjectival idiom (AdjID) and adverbial idiom (AdvID). Those are multiword expressions whose neutral form has a distribution of an adjective or an adverb, respectively.

  • larger than life behaving in a way that is more exciting than other people to attract - AdjMWE
  • dzimis laimes krekliņā born in a shirt of luck lucky - an AdjMWE
    aiz restēm behind bars in prison - an AdvMWE
  • fris en fruitig raring to go - AdjMWE
  • urodzona w niedzielę born on Sunday lazy - an AdjMWE
    średnio na jeża averagely on a hedghog not great - an AdvMWE

They do not have to be headed by adjective or adverbs, as in:

  • the other way round - an AdvMWE headed by a noun
  • pūt un palaid blow and let go frivolous - an AvjMWE containing no adjectives
  • out of the box - an AdvMWE headed by a noun
  • na potęgę on power very much - an AvdMWE containing no adverb

Additionally, we cover AMWEs which derive from verbal MWEs but their neutral form has an adjectival/adverbial distribution (see above), rather than a verbal one. The extent of such MWEs is yet unknown.

Functional MWEs

A functional MWE (FuncMWE) is a multiword expression whose neutral form has a distribution of a function word. We consider four subcategories of FuncMWEs:

  • determiner idiom (DetID)
    • I work from home roughly every other day
    • zadałem sobie to samo pytanie I asked myslef this same question I asked myslef the same question
      przekaż mu te oto słowa transfer him these here words transfer him these words
  • adposition idiom (AdpID)
    • in front of the station
    • tas pats cilvēks that self person the same person
      katru otro dienu every second day every other day
    • op basis van based on
    • gwarancji nie ma nawet w przypadku arcymistrza there is no guarantee event in the case of a grandmaster
      co do pierwszego pytania what to the first question as to the first question
  • conjunction idiom (ConjID)
    • she was fortunate in that she had friends to help her
    • la cérémonie sera projetée sur grand écran afin que tout le monde puisse suivre the ceremony will be projected on a big screen so that everyone can follow
    • lai gan although
      neskatoties uz not looking at nevertheless
    • zmęczony mimo źe dzień się dopiero zaczynał tired although that the day was only beginning tired although the day was only beginning
  • interjection idiom (IntjID)
    • damn it!
    • bon sang! good blood! damn it!
    • pie velna! at the devil! Damn it!
    • do diabła! To the devil! Damn it!

Functional MWEs constitute relatively short closed lists of cases. We recommend establishing such lists for each language and apply them consistently to corpus annotation (while paying attention to possible ambiguity), like in:

  • By the way, are you coming to Budapest?
    I recognized her by the way she was walking.
  • met betrekking tot regarding
  • co do pierwszego pytania what to the first question as to the first question
    rozumiesz co do ciebie mówię? do you understand what to you I say? do you understand what I'm telling you?

Of course, we still need criteria to decide which candidates should occur in such lists. But testing functional MWE candidates for non-compositionality is notoriously hard because they contain few content words (nouns, verbs, adjectives or adverbs) and have syntactic structures in which little flexibility is allowed, even with no presence of idiomaticity. The solution is to be consistent with the FuncMWE-specific decision diagram [add the link] (which is deterministic, whenever the answers to atomic tests remain stable), even if it does not fully conform to out intuitions.


An error has occured !