PARSEME Shared Task 1.3 - Annotation guidelines

Annotation guidelines (version 2.0; UNDER CONSTRUCTION)
Used by the corpora annotated for multiword expressions

Multiword expressions

A multiword expression (MWE) is a (continuous or discontinuous) sequence of words with the following compulsory properties:

It contains at least two component words which are lexicalised, i.e. always realized by the same lexemes. Only these lexicalized components are annotated. For instance in he paid several important visits to the president, we annotate only the components highlighted in bold.
Its neutral form forms a weakly connected graph, i.e., in its dependency graph, every (lexicalized) component is achievable from every other component, if directions of the dependencies are disregarded. For instance, in the following MWE the highlighted components do not form a weakly connected graph but this form in not a neutral one. When transforming it to a neutral form the connectivity condition is fulfilled.
It shows some degree of orthographic, morphological, syntactic and/or semantic idiosyncrasy with respect to what is considered general grammar rules of a language. This condition is tested by the decision diagrams documented in in sections 5 to 9. Collocations, i.e. word co-occurrences whose idiosyncrasy is of statistical nature only (e.g. the graphic shows, drastically drop) are not considered MWEs.

Probably the most salient property of MWEs is semantic non-compositionality. In other words, it is often impossible to straightforwardly deduce the meaning of the whole unit from the meanings of its parts and from its syntactic structure. For instance, while it is easy to interpret phrases like to kick the ball or to spill some water from the words that compose them, it is almost impossible to guess, without knowing it beforehand, that to kick the bucket means 'to die' and to spill the beans actually means 'to reveal a secret'.

However, as non-compositionality is a subjective notion and is hard to test directly, we use inflexibility as a proxy in the tests. Our underlying hypothesis is that MWEs have some degree of semantic non-compositionality that implies limited flexibility.

Depending on the distribution of its neutral form, a MWE can be verbal, nominal, adjectival, adpositional, etc.

Verbal MWEs

A verbal MWE (VMWE) is a multiword expression whose neutral form is such that: (i) it has a distribution of a verb, a verbal phrase or a verbal clause, (ii) its syntactic head is a verb.

she paid several visits to the president

pūst miglu acīs to blow mist into eyes to lie, to talk nonsense

władza czerpie z tego korzyści propagandowe the authorities draw propaganda benefits from this the authorities reap benefits from this for propaganda

Note that reasoning in terms of neutral forms is crucial here. A MWE may occur in a variant whose distribution is non-verbal. But when its neutral form is retrieved, the verbal distribution becomes apparent, and such a MWE is considered verbal.

the visits which she paid to the president - the distribution of this MWE is nominal but this is not a neutral form; when neutralized the verbal distribution and the verb headedness are restored

czerpane z tego korzyści propagandowe - the distribution of this MWE is nominal but this is not a neutral form; when neutralized the verbal distribution and the verb headedness are restored

Conversely, some MWEs derive from VMWEs but their neutral forms are not verbal. Such MWEs are considered deverbal nominal, adjectival or adverbial MWEs:

Wortbruch word-break a promise which has not been hold - nominal MWE deriving from ein Wort brechen

a take-off - nominal MWE deriving from to take off
(a) run-down (apartment) - adjectival MWE deriving from to run down

la prise en compte the fact of taking into account - nominal MWE deriving from prendre en comptetake into account
une mise à disposition the fact of making available - nominal MWE deriving from mettre à dispositionmake available

zabawa czyimś kosztem a play at someone else's expenses - nominal MWE derived from bawić się czyimś kosztem to enjoy oneself at someone else's expenses

Some other MWEs contain verbs but are not derived from VMWEs and have a non-verbal distribution (nominal, adjectival, adverbial, etc.). These candidates are assigned the category which conforms with their distribution: nominal MWEs, modifier MWEs or functional MWEs.

Vergiss-mein-nicht forget-me-notforget-me-not - nominal MWE

forget-me-not - nominal MWE

peut-être may-be maybe - adverbial MWE
porte-feuille carry-sheets wallet - adverbial MWE
couru d'avance run in advance forgone/predictable - adjectival MWE

pūt un palaid blow and let gofrivolous, absent-minded - adjectival MWE

vergeet-mij-niet forget-me-not forget-me-not - nominal MWE

(zrobić coś za) Bóg-zapłać (do something for a) God-pay to do something for free nominal MWE

Nominal MWEs

A nominal MWE (NMWE) is a multiword expression whose neutral form has a distribution of a noun.

I’ll have a hot dog for lunch.
This was a real wild goose chase a foolish and hopeless search for or pursuit of something unattainable.

Leurs armes blanches sont en acier inoxydable Their white weapons are made of stainless steel Their bladed weapons are made of stainless steel

zili brīnumi blue wonder something unusual, surprising

Hij lust geen blinde vink He doesn't like 'blinde vink' He doesn't like blinde vink (Dutch meat)

Ostatnia transakcja okazała się dla firmy gwoździem do trumny The last transaction turned out for the company a nail to the coffin The last transaction turned out to be an event that caused the failure of the company
W antykwariacie znalazła kilka białych kruków In the antique shop she found a few white ravens In the antique shop she found a few very rare books

It may or may not be headed by a noun:

Vergiss-mein-nicht forget-me-not

forget-me-not

porte-feuille carry-sheets wallet

vergeet-mij-niet forget-me-not forget-me-not

(zrobić coś za) Bóg-zapłać (do something for a) God-pay to do something for free

A major challenge in annotating NMWEs is to distinguish them from proper names and multiword terms. Proper names have a special semantic status because they function as names of entities rather than their descriptions. Proper names may contain MWEs and vice versa but most proper names do not pass the linguistic tests proposed here and thus we do not consider them MWEs. We defined specific tests (SPECIF-REF, NAMING-CONV and SEM-TYPE) to distinguish proper names from MWEs.

John Smith - entity name, not a MWE
UN Secretary-General - entity name containing a NMWE

Agnieszka Kownacka - entity name, not a MWE
Jego Królewska Mość Król Belgii His Royal Majesty the King of Belgium - entity name containing a NMWE

Mutiword terms overlap with MWEs. Examples include:

white gold an alloy consisting of gold and platinum or nickel

pied d'athlète athlete's foot skin infection of the feet caused by a fungus

acs ābols apple of an eye eyeball

biały metal white metal alloy containing approximately 88% of tin
rok świetlny light year a distance covered by a light ray in 1 year

But many mutiword terms do not pass inflexibility tests either and we consider them semantically compositional (i.e. non-MWEs), as in:

bipolar disorder

affection respiratoire aiguë acute respiratory affection acute respiratory disease

ēšanas traucējumieating disorders

obiektowy język programowania object programming language object-oriented programming language

Note that some MWEs whose internal structure is the one of a nominal phrase have a distribution of an adverb, an adposition or an adjective, etc. Those should not be annotated as NMWEs but as functional/adjectival/adverbial:

sailing head to wind sailing with the bow of the boat facing directly into the wind

Ils marchent main dans la main They are walking hand in hand They are walking holding each other's hand - adverbial MWE
je fais ça toute seule, les doigts dans le nez I do it alone, fingers in my nose I do it easily

par mata tiesu - adverbial MWE

Zij lopen hand in hand They are walking hand in hand They are walking holding each other's hand - adverbial MWE

Wygrali tę wojnę psim swędem They won this war by a dog's stench/itch They won this war by a lot of luck - adverbial MWE

Recall that a MWE may be a multiword token. Deciding what is a word is notoriously difficult, especially in languages exhibiting frequent closed compounds, like Germanic languages. Closed compounds (i.e. compounds in which components are spelled together, possibly with some phonological changes on the border of morphemes) can be idiomatic:

Meer|schweinchen little see pig cobaye

passerby - inflects like a nominal phrase: passers|by

bonhomme good man fellow - inflects like a nominal phrase: bons|hommes

rzeczpospolita thing popular republic - inflects like a nominal phrase: rzeczy|pospolitej

or fully compositional:

Schul|jahr school year

school|jaar school year

or partly idiomatic and partly compositional:

We consider closed compounds as containing several words, and submit them to the PARSEME decision diagrams and annotate them as NMWEs if the tests are passed. We hypothesize that, most of the time, it is straightforward to annotators to identify word boundaries in a closed compound. If this is not the case, language-specific rules must be added. Splitting closed compounds directly in the corpus, if they are not split already, is not recommended, so as to keep the tokenization consistent with the underlying morpho-syntactic annotation.

It happens that only part of a closed compound is idiomatic. For such cases, a UD/PARSEME white paper proposes subtoken spans, e.g.:

Hauptrolle spielen to play the main role - Role spielen to play a role is a VMWE, but the noun Role role can be freely modified, which yields a closed compound like Hauptrole main role

This feature is not implemented yet. In the meantime, we suggest annotating the whole token as belonging to the MWE.

We consider that nominal MWEs embrace pronominal MWEs:

I saw just a few
I expect no one to come
we love each other

dažs labs few good somebody

powtarzał ciągle to samo he repeated always this the same he repeated always the same

Similarly to functional MWEs (below), pronominal MWEs constitute closed lists of cases, and their inflexibility is hard to test. They are also frequently ambiguous with idiomatic determiners.

I saw a few - a PronID
I saw a few examples - a DetID

dažs labs jūtas svarīgs few good feels important somebody feels important - a PronMWE
dažs labs šoferis jūtas svarīgs few good driver feels important some drivers feel important - a DetMWE

Ik gaf een paar voorbeelden I gave a few examples - a DetID

powtarzał ciągle to samo he repeated always this the same he repeated always the same - a PronMWE
powtarzał ciągle to samo pytanie he repeated always this the same question he repeated always the same question - a DetMWE

Adjectival and adverbial MWEs

The class of adjectival and adverbial MWEs (AMWEs) includes adjectival idiom (AdjID) and adverbial idiom (AdvID). Those are multiword expressions whose neutral form has a distribution of an adjective or an adverb, respectively.

larger than life behaving in a way that is more exciting than other people to attract - AdjMWE

dzimis laimes krekliņā born in a shirt of luck lucky - an AdjMWE
aiz restēm behind bars in prison - an AdvMWE

fris en fruitig raring to go - AdjMWE

urodzona w niedzielę born on Sunday lazy - an AdjMWE
średnio na jeża averagely on a hedghog not great - an AdvMWE

They do not have to be headed by adjective or adverbs, as in:

the other way round - an AdvMWE headed by a noun

pūt un palaid blow and let go frivolous - an AvjMWE containing no adjectives

out of the box - an AdvMWE headed by a noun

na potęgę on power very much - an AvdMWE containing no adverb

Additionally, we cover AMWEs which derive from verbal MWEs but their neutral form has an adjectival/adverbial distribution (see above), rather than a verbal one. The extent of such MWEs is yet unknown.

Functional MWEs

A functional MWE (FuncMWE) is a multiword expression whose neutral form has a distribution of a function word. We consider four subcategories of FuncMWEs:

determiner idiom (DetID)

I work from home roughly every other day

zadałem sobie to samo pytanie I asked myslef this same question I asked myslef the same question
przekaż mu te oto słowa transfer him these here words transfer him these words

adposition idiom (AdpID)

in front of the station

tas pats cilvēks that self person the same person
katru otro dienu every second day every other day

op basis van based on

gwarancji nie ma nawet w przypadku arcymistrza there is no guarantee event in the case of a grandmaster
co do pierwszego pytania what to the first question as to the first question

conjunction idiom (ConjID)

she was fortunate in that she had friends to help her

la cérémonie sera projetée sur grand écran afin que tout le monde puisse suivre the ceremony will be projected on a big screen so that everyone can follow

lai gan although
neskatoties uz not looking at nevertheless

zmęczony mimo źe dzień się dopiero zaczynał tired although that the day was only beginning tired although the day was only beginning

interjection idiom (IntjID)

damn it!

bon sang! good blood! damn it!

pie velna! at the devil! Damn it!

do diabła! To the devil! Damn it!

Functional MWEs constitute relatively short closed lists of cases. We recommend establishing such lists for each language and apply them consistently to corpus annotation (while paying attention to possible ambiguity), like in:

By the way, are you coming to Budapest?
I recognized her by the way she was walking.

met betrekking tot regarding

co do pierwszego pytania what to the first question as to the first question
rozumiesz co do ciebie mówię? do you understand what to you I say? do you understand what I'm telling you?

Of course, we still need criteria to decide which candidates should occur in such lists. But testing functional MWE candidates for non-compositionality is notoriously hard because they contain few content words (nouns, verbs, adjectives or adverbs) and have syntactic structures in which little flexibility is allowed, even with no presence of idiomaticity. The solution is to be consistent with the FuncMWE-specific decision diagram [add the link] (which is deterministic, whenever the answers to atomic tests remain stable), even if it does not fully conform to out intuitions.

An error has occured !

Annotation guidelines (version 2.0; UNDER CONSTRUCTION) Used by the corpora annotated for multiword expressions