Annotation guidelines (version 2.0; UNDER CONSTRUCTION)
Used by the
corpora annotated for multiword expressions
Multiword expressions
A multiword expression (MWE) is a (continuous or discontinuous) sequence of words with the following compulsory properties:
- It contains at least two component words which are lexicalised, i.e. always realized by the same lexemes. Only these lexicalized components are annotated. For instance in he paid several important visits to the president, we annotate only the components highlighted in bold.
- Its neutral form forms a weakly connected graph, i.e., in its dependency graph, every (lexicalized) component is achievable from every other component, if directions of the dependencies are disregarded. For instance, in the following MWE
the highlighted components do not form a weakly connected graph but this form in not a neutral one. When transforming it to a neutral form
the connectivity condition is fulfilled.
- It shows some degree of orthographic, morphological, syntactic and/or semantic idiosyncrasy with respect to what is considered general grammar rules of a language. This condition is tested by the decision diagrams documented in in sections 5 to 9. Collocations, i.e. word co-occurrences whose idiosyncrasy is of statistical nature only (e.g. the graphic shows, drastically drop) are not considered MWEs.
Probably the most salient property of MWEs is semantic non-compositionality. In other words, it is often impossible to straightforwardly deduce the meaning of the whole unit from the meanings of its parts and from its syntactic structure. For instance, while it is easy to interpret phrases like to kick the ball or to spill some water from the words that compose them, it is almost impossible to guess, without knowing it beforehand, that
However, as non-compositionality is a subjective notion and is hard to test directly, we use inflexibility as a proxy in the tests. Our underlying hypothesis is that MWEs have some degree of semantic non-compositionality that implies limited flexibility.
Depending on the distribution of its neutral form, a MWE can be verbal, nominal, adjectival, adpositional, etc.Verbal MWEs
A verbal MWE (VMWE) is a multiword expression whose neutral form is such that: (i) it has a distribution of a verb, a verbal phrase or a verbal clause, (ii) its syntactic head is a verb.
Note that reasoning in terms of neutral forms is crucial here. A MWE may occur in a variant whose distribution is non-verbal. But when its neutral form is retrieved, the verbal distribution becomes apparent, and such a MWE is considered verbal.
Conversely, some MWEs derive from VMWEs but their neutral forms are not verbal. Such MWEs are considered deverbal nominal, adjectival or adverbial MWEs:
(a) run-down (apartment) - adjectival MWE deriving from to run down
une mise à disposition the fact of making available - nominal MWE deriving from mettre à dispositionmake available
porte-feuille carry-sheets wallet - adverbial MWE
couru d'avance run in advance forgone/predictable - adjectival MWE
Nominal MWEs
A nominal MWE (NMWE) is a multiword expression whose neutral form has a distribution of a noun.
This was a real wild goose chase a foolish and hopeless search for or pursuit of something unattainable.
W antykwariacie znalazła kilka białych kruków In the antique shop she found a few white ravens In the antique shop she found a few very rare books
It may or may not be headed by a noun:
A major challenge in annotating NMWEs is to distinguish them from proper names and multiword terms. Proper names have a special semantic status because they function as names of entities rather than their descriptions. Proper names may contain MWEs and vice versa but most proper names do not pass the linguistic tests proposed here and thus we do
UN Secretary-General - entity name containing a NMWE
Jego Królewska Mość Król Belgii His Royal Majesty the King of Belgium - entity name containing a NMWE
Mutiword terms overlap with MWEs. Examples include:
rok świetlny light year a distance covered by a light ray in 1 year
But many mutiword terms do not pass inflexibility tests either and we consider them semantically compositional (i.e. non-MWEs), as in:
Note that some MWEs whose internal structure is the one of a nominal phrase have a distribution of an adverb, an adposition or an adjective, etc. Those should not be annotated as NMWEs but as functional/adjectival/adverbial:
je fais ça toute seule, les doigts dans le nez I do it alone, fingers in my nose I do it easily
Recall that a MWE may be a multiword token. Deciding what is a word is notoriously difficult, especially in languages exhibiting frequent closed compounds, like Germanic languages. Closed compounds (i.e. compounds in which components are spelled together, possibly with some phonological changes on the border of morphemes) can be idiomatic:
or fully compositional:
or partly idiomatic and partly compositional:
We consider closed compounds as containing several words, and submit them to the PARSEME decision diagrams and annotate them as NMWEs if the tests are passed. We hypothesize that, most of the time, it is straightforward to annotators to identify word boundaries in a closed compound. If this is not the case, language-specific rules must be added. Splitting closed compounds directly in the corpus, if they are not split already, is not recommended, so as to keep the tokenization consistent with the underlying morpho-syntactic annotation.
See also the UniDive task on harmonizing the definition of a “syntactic word” across languages.It happens that only part of a closed compound is idiomatic. For such cases, a UD/PARSEME white paper proposes subtoken spans, e.g.:
This feature is not implemented yet. In the meantime, we suggest annotating the whole token as belonging to the MWE.
We consider that nominal MWEs embrace pronominal MWEs:
I expect no one to come
we love each other
Similarly to functional MWEs (below), pronominal MWEs constitute closed lists of cases, and their inflexibility is hard to test. They are also frequently ambiguous with idiomatic determiners.
I saw a few examples - a DetID
dažs labs šoferis jūtas svarīgs few good driver feels important some drivers feel important - a DetMWE
powtarzał ciągle to samo pytanie he repeated always this the same question he repeated always the same question - a DetMWE
Adjectival and adverbial MWEs
The class of adjectival and adverbial MWEs (AMWEs) includes adjectival idiom (AdjID) and adverbial idiom (AdvID). Those are multiword expressions whose neutral form has a distribution of an adjective or an adverb, respectively.
aiz restēm behind bars in prison - an AdvMWE
średnio na jeża averagely on a hedghog not great - an AdvMWE
They do not have to be headed by adjective or adverbs, as in:
Additionally, we cover AMWEs which derive from verbal MWEs but their neutral form has an adjectival/adverbial distribution (see above), rather than a verbal one. The extent of such MWEs is yet unknown.
Functional MWEs
A functional MWE (FuncMWE) is a multiword expression whose neutral form has a distribution of a function word. We consider four subcategories of FuncMWEs:
- determiner idiom (DetID)
- adposition idiom (AdpID)
- conjunction idiom (ConjID)
- interjection idiom (IntjID)
przekaż mu te oto słowa transfer him these here words transfer him these words
katru otro dienu every second day every other day
co do pierwszego pytania what to the first question as to the first question
neskatoties uz not looking at nevertheless
Functional MWEs constitute relatively short closed lists of cases. We recommend establishing such lists for each language and apply them consistently to corpus annotation (while paying attention to possible ambiguity), like in:
I recognized her by the way she was walking.
rozumiesz co do ciebie mówię? do you understand what to you I say? do you understand what I'm telling you?
Of course, we still need criteria to decide which candidates should occur in such lists. But testing functional MWE candidates for non-compositionality is notoriously hard because they contain few content words (nouns, verbs, adjectives or adverbs) and have syntactic structures in which little flexibility is allowed, even with no presence of idiomaticity. The solution is to be consistent with the FuncMWE-specific decision diagram [add the link] (which is deterministic, whenever the answers to atomic tests remain stable), even if it does not fully conform to out intuitions.