Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.0 (2017)


Frequently Asked Questions (FAQ)

Annotators often face questions and challenging examples. When several annotators ask the same question, we will update the list of frequently asked questions.

However, we suggest that language teams set up another communication platform to deal with questions that are specific to a language. This can take the form of a shared online document, a wiki, a dedicated bug tracking system or mailing list. We also suggest keeping track of decisions taken considering borderline examples (with a list of expressions to which the decision applies). These should be kept in a centralized document or page that all annotators can access.

Whenever you think that a question can also be interesting to other languages, please notify the organizers and we will try to update this page.

  1. How to define an unexpected change in meaning​?
  2. How to annotate lexicalized words which belong to contractions and compounds?
  3. How to annotate coordinated​ VMWEs sharing some components?
  4. How to annotate elliptical​ occurrences of VMWEs?
  5. How to annotate VMWEs that seem to belong to more than one category​?
  6. How to annotate embedded​ VMWEs?
  7. Are existential expressions with there is/are considered VMWEs?
  8. How to categorize VMWEs which seem LVCs​ but do not pass all LVC tests?
  9. Why are verb+noun constructions with pure​ operator verbs​ (to commit, to make, to have etc.) considered LVCs?
  10. Does the IReflV category include verbs with non-­reflexive clitics?
  11. Should nominalizations​ of VMWEs be annotated?
  12. How to express hesitation between different VMWE categories?
  13. In test 9, how can one decide whether an abstract noun is an event or a state?
  14. How does one decide if a more or less frozen determiner is a lexicalized VMWE component?
  15. Should I annotate compound and serial verbs as VMWEs? Of which category?
  16. If an LVC contains a complex (fixed) NP as a dependent, should I include the whole NP or just the head?

1. How to define an unexpected change in meaning​?

Check the glossary entry that defines undexpected change in meaning

2. How to annotate lexicalized words which belong to contractions and compounds?

In some languages prepositions, clitics and determiners are subject to contractions (i.e. they yield multi­word tokens, MWTs). Tokenizers might not handle contraction splitting properly. In this case, a lexicalized component of a VMWE can be merged with an external word:

  • n.a.
  • haberse suicidado have+REFL suicided committed suicide
  • n.a.
  • n.a.

A similar problem occurs in languages with productive compounding, where a lexicalized component of a VMWE and a free modifier can build up a multitoken word (since compound splitting might not be a standard feature of a tokenizer):

  • unter Drogeneinfluss stehen to be under the influence of drugs
    Heisshunger haben to have hot hunger to be ravenously hungry
  • n.a.

Since the current annotation format is token­-based, we prohibit correcting tokenization errors and compound splitting by the annotators for the sake of coherence. Therefore the annotation of such contractions and compounds finds no fully satisfactory solution in our schema. We propose to annotate a whole MWT each time it contains a word which is part of a VMWE. Annotators should add a textual comment about the mixed status of this MWT:

  • Drogeneinfluss → MWT containing a lexicalized VMWE component Einfluss and an external word Drogen
    Heisshunger → MWT containing a lexicalized VMWE Hunger and an additional modifier heiss
  • haberse → MWT containing a lexicalized VMWE component se and an external word haber
  • n.a.
3. How to annotate coordinated​ VMWEs sharing some components?

A component shared by two or more coordinated VMWEs should be annotated as belonging to ​both of them.

  • Regeln und Richtlinien aufstellen to set up rules and guidelines to draw up rules and guidelines aufstellen must be annotated both as part of​ to Regeln aufstellen to lay down rules and of Richtlinien aufstellen to draw up guidelines
  • to have a walk or a ride have must be annotated both as part of​ to have a walk and of to have a ride
  • odprawić mszę i pokutę celebrate a mass and a penanceodprawić should be annotated both as part of​ odprawić mszę to celebrate a mass and of odprawić pokutę to celebrate a penance
  • imeti dober želodec in dobre živce to have a good stomach to bear something well and good nerves to be mentally strong imeti have must be annotated both as part of​ imeti dober želodec and of imeti dobre živce
4. How to annotate elliptical​ occurrences of VMWEs?
Instances of a VMWE in which all but one lexicalized component were omitted or pronominalized should not be annotated. This concerns in particular the cases where a nominal component is concerned by anaphora. For instance, in this decision was hard but he took it, we should not annotate take and decision or it as an instance of a VMWE. We annotate only the transformations in which the syntactic dependency link between the head verb and the ​lexicalized ​complement is preserved, e.g. the decision which he took.
5. How to annotate VMWEs that seem to belong to more than one category​?

Such hesitation issues should normally be solved by the decision trees 1 and 2. For instance, consider the German expression sich eine Frage stellen SELF a question put to doubt. It may seem to belong to both IReflV, since sich is required only if stellen co-occurs with Frage, and LVC, since Frage keeps its original meaning and stellen brings no additional meaning. However, test 7 [1DEP] indicates that an expression like this should be annotated as ID, since the verb has more than one lexicalized syntactic dependent.

Similarly, the French expression avoir peur have fear to be afraid seems to have features of an ID. Unlike most LVCs, ­it does not allow a determiner *avoir une peur have a fear, except when the noun is modified avoir une grande peur have a great fear. However, test 8 [CATEG] in decision tree 2, and the LVC-­specific decision tree indicate that it belongs to the LVC category.

6. How to annotate embedded​ VMWEs?

Candidate VMWEs embedded in other VMWEs should be annotated only if they have a VMWE status also outside the particular context. For instance, the VMWE to let the cat out of the bag should be annotated as ID, and its embedded VMWE to let out as a VPC.

On the other hand, the French expression se faire des idées SELF make DET.PL ideas to imagine things which are not true, se faire should not be annotated as IReflV, since it is not inherently reflexive as a standalone verb+clitic combination.

7. Are existential expressions with there is/are considered VMWEs?

Hesitations about a possible LVC status can arise with respect to existential constructions with nouns introducing events or properties (see test 9 [N­-EVENT]) as in:

  • es gibt Beschwerden there are complaints
  • there are complaints
  • il existe des plaintes it there has complaints there are complaints
  • n.a.
  • queixas has complaints there are complaints
  • imeti pripombe have complaints there are complaints

Namely, the noun keeps its original sense and the existential verb to be or to have brings no additional meaning. However, a candidate LVC must also pass test 12 [V­-REDUC]. This requires the modification of the noun by the verb's subject, which is impossible with impersonal and empty subjects like there. Therefore, such candidates cannot be LVCs.

Note,​ however, that existential expressions themselves can be VMWEs of type ID. For instance, in the French example il y a des plaintes it there has complaints there are complaints, two dependents of the verb a has are lexicalized: il it and y there, therefore it is an ID (see test 7 [1DEP]).

8. How to categorize VMWEs which seem LVCs​ but do not pass all LVC tests?

If at least one of the five LVC tests (9 to 13) is not passed, the candidate is not considered an LVC. For the sake of a deterministic VMWE categorization and higher inter-­annotator agreement, we admit a definition of an LVC which might seem more restrictive than some linguistic studies usually assume. Thus, we exclude from the LVC scope:

  • expressions in which the verb's syntactic subject is not necessarily the noun's semantic subject, like to give courage or to make an impression. These candidates do not pass test 12 [V-­REDUC].
  • expressions where the lexicalized nominal dependent of the verb is its subject, as in the problem lies in something; these candidates do not pass test 12 [V-­REDUC].
  • expressions with aspectual verbs, as in to start, to pursue, to stop a walk. These do not pass test 11 [V-­LIGHT] since they add (aspectual) semantics to the noun. The only exception is when the noun itself is already aspectual, as in to come into bloom
9. Why are verb+noun constructions with pure​ operator verbs​ (to commit, to make, to have etc.) considered LVCs?

Pure operator verbs, i.e. such verbs which never have any semantics per se but only carry the grammatical (tense, mood etc.) information, seem to contradict the intuition behind a VMWE. Namely, they usually select a whole semantic class of nouns. For instance to commit selects any negative act (a crime, a suicide, a theft) and to perform selects any activity (a task, an experiment, a miracle). In this sense, their complements resemble open slots and the whole combinations resemble collocations. However, for the sake of a deterministic VMWE categorization and higher inter­-annotator agreement, we do include verb+noun combinations with pure operator verbs, such as to commit a crime and to ​perform a task, into the LVC category. This is because such combinations pass all 5 LVC­-specific test (9 through 13).

We could have organized decision tree 1 differently and exclude such cases from the VMWE scope by eliminating the LVC hypothesis. Then, to commit a crime and to perform a task would pass none of the tests from 1 to 5 and would be eliminated. However, we would also have to eliminate prototypical LVCs like to make a decision (it passes none of the tests from 1 to 5 either), which we do wish to take in as an LVC.

10. Does the IReflV category include verbs with non­-reflexive clitics?

No, the IReflV category only includes (some) combinations of a head verb with a reflexive clitic. As indicated in the borderline cases page of IReflV category, other pronouns, whenever lexicalized, trigger the ID category. Recall that whenever more than one dependent of the verb is lexicalized (including or not a reflexive clitic), the VMWE is always categorized as an ID

  • sich Fragen stellen SELF questions put to doubt
  • s'en aller SELF of-there go to leave
  • n.a.
  • ucvreti jo to escape her to escape something/someone by running
11. Should nominalizations​ of VMWEs be annotated?

The only nominal VMWE variants within our annotation scope are those:

  • headed by the gerund stemming from the head verb of the VMWE - taking of the decision, and
  • in which a noun stemming from a VMWE is modified by a participle or a relative clause headed by the verb stemming from the same VMWE - the decisions taken yesterday, the decision which he took.

Other nominalizations are excluded:

  • Wortbruch word-break a promise which has not been hold
  • a break-down, a forget-me-not
  • la prise en compte the taking into account the fact of taking something into account, peut-être may-be maybe, porte-feuilles carry-sheets wallet
  • zabawa czyimś kosztem a play at someone else's expenses derived from bawić się czyimś kosztem to enjoy oneself at someone else's expenses
  • šala na tuj račun a joke at someone else's expenses derived from šaliti se na tuj račun to play a joke on someone

For practical reasons (e.g. compatibility with an existing annotation, or usefulness for a particular application) they can be considered language-specific VMWEs but then a new category should be defined for them, so as to keep the universal and the quasi­-universal categories intact

12. How to express hesitation between different VMWE categories?

Once identified in a text, each VMWE is to be assigned to exactly one category. Note that in this version of the guidelines we no longer admit "hesitation labels" (e.g. LVC/ID) used in the pilot annotation. Hesitation can, however, be expressed in a comment and a particular value of the annotator's confidence assigned to a particular VMWE occurrence.

13. In test 9, how can one decide whether an abstract noun is an event or a state?

The goal of test 9 is to identify whether a nouns is predicative, that is, whether it requires some semantic arguments. We talk about events and states to circumvent the question of whether a noun is predicative. Here, they are understood very largely as roughly corresponding to binary and unary predicates. For instance, we consider that an event is something that happens, and can be related to an action, activity, process or phenomenon. A state is understood as a property that may or may not change over time, including feelings, sensations, permanent and temporary properties and relations between entities. These are a very generic definitions that go far beyond the scope of what is commonly understood as an event or state.

While it is hard to define required tests to identify a predicative noun, there are some useful clues that can be used for abstract nouns (sufficient criteria).

Verb paraphrase: Is the abstract noun derivationally related to a verb with the same semantics?

  • John makes a decision = John decides
    John has a walk = John walks

Adjective paraphrase: Is the abstract noun derivationally related to an adjective with the same semantics?

  • John has courage = John is courageous → and, more generally, characteristics and attributes
    John has hunger/thirst = John is hungry/thirsty → and, more generally, physical sensations
    John has passion/fear/anger = John is passionate/afraid/angry → and, more generally, feelings and emotions
    John has problems/difficulties = Something is problematic/difficult for John → and, more generally, states

Synonym verb or adjective paraphrase: Does the abstract noun have a synonym/hypernym derivationally related to a verb or adjective with the same semantics?

  • John and Mary reach a consensus = John and Mary agree consensus has no corresponging verb or adjective, but agreement is a synonym
    John has a chance to do something = John is likely to do something chance has no corresponding verb or adjective, but likelihood is a synonym

For many classes of abstract nouns, it can be tricky to apply the tests above. We advise listing in a separate document those classes of nouns that pass test 9 in your language. We suggest considering that the following categories pass test 9:

  • Illnesses, symptoms and health conditions:
    John has a flu = John is ill (illness is a hypernym of flu)
    Relations:
    John has contact with somebody = John contacts somebody
    John has an affair with somebody = John is involved with somebody (involvement is a synonym of affair)
    Mental content (internal to a cognizer):
    John has a worry = John worries
    John has an idea = John thinks (thought is a synonym of idea)
    John has an opinion = John believes (belief is a synonym of opinion)

Please notice that events and states that have no semantic arguments do not pass test 9, even if they have verbal/adjectival paraphrases:

  • Natural phenomena: rain, snow, tornado, flood, earthquake
    Informational content (external to a cognizer): information, news

Finally, notice that not any verb + predicative noun combination forms an LVC. Additionally, the verb needs to be "light", not adding semantics to the noun. The remaining LVC tests (tests 10 to 13) guarantee this.

14. How does one decide if a more or less frozen determiner is a lexicalized VMWE component?

Most of the time, it is easy to test whether a determiner is lexicalized by searching alternatives in corpora (or on the web). For instance, the is lexicalized in to kick the bucket because searches for other determiners (this, a, some, three, many, etc.) either do not return any result or return only literal uses of this verb phrase.

However, borderline cases do exist, in which alternatives are rare but possible, specially for LVCs and decomposable IDs. For instance, while the standard form of the idiom spill the beans forbids some determiners (#spill three/twenty beans), it is possible to find some variation (spill these/many/all/my/his/more/no beans).

We argue that the selection of some determiners (but not all) by a VMWE is comparable to selected prepositions for verbs. Thus, it can be seen as a regular grammatical phenomenon, suggesting that when the determiner varies, then it should not be included. In some VMWEs, though, determiner variation may be considered as marginal and/or incorrect, which means that it should be included in the scope of the annotated VMWE.

In short, determiners can exhibit limited variability. As a consequence, each language should document their decisions as to whether to include them or not for particular VMWE classes, to ensure consistency.

  • avoir la pêche have the peach to have much energy
    avoir de la chance have some luck to be lucky
    avoir l'occasion to have the opportunity

After annotation, we suggest that LLs use the provided analysis scripts to detect inconsistencies in the annotation of the same VMWE (e.g., including or not a determiner). They can then take an arbitrary decision and homogenise all annotated occurrences.

15. Should I annotate compound and serial verbs as VMWEs? Of which category?

It depends. Most of the languages covered by the shared task for the moment do not have this kind of verb. The guidelines were written having these languages in mind, so they are not clear about compound verbs

In many Indo-European languages (including Germanic, Romance and Balto-Slavic families), verbal chains using auxiliary and modal verbs are used to express tense, modality and aspect. This is a regular linguistic phenomenon that can be applied to any verb and should not be annotated.

On the other hand, some languages like Maltese have many compound verbs that do not necessarily express tense, mood and modality. We suggest that, when the verb combinations regularly combine with any other verb adding a given meaning, they should not be annotated. Future versions of these guidelines should study the need for a new category for compound verbs, in order to cover this phenomenon.

In short, verbal chains should only be annotated as ID when they are idiomatic:

  • laisser tomber let fall to give up
    vouloir dire want say to mean
    faire tomber make fall to drop
    vouloir changer want change to want to change
  • dak x'mar jgħid ilbieraħ that (person) what'he-went he-says yesterday what the hell did he say yesterday
  • querer dizer want say to mean
    querer falar want speak to want to speak
16. If an LVC contains a complex (fixed) NP as a dependent, should I include the whole NP or just the head?

The guidelines determine that only lexicalized components should be annotated. Therefore, we suggest that, in such cases, if the NP is compositional, only the head of the NP is included in the scope of the LVC. This may lead to the annotation of odd LVCs that actually never occur by themselves without a modifier. This is not a problem and is already the case for other VMWEs, e.g. the ones that only occur with a determiner, but the determiner is not lexicalized. The only cases where the NP should be included as a whole is if the complement is a non-compositional MWE, so that it would not make any sense to annotate only the head.

  • παίζω το χαρτί του ευρωσκεπτικισμού to-play the paper the.SG.GEN euroscepticism.SG.GEN to use the asset of euroscepticism, to use euroscepticism as an asset
    κάνω στάση εργασίας to-make stop work.SG.GEN to go on strike, to strike → the expression στάση εργασίας is non-compositional (term)
  • présenter un Syndrome Coronairien Aigu to present an acute coronary syndrome
    mener une vie de débauche to have a life of pleasures
    faire un faux pas make a false step to commit a faux pas → the expression faux pas is non-compositional
  • mieć wyrzuty sumienia to have reproaches of the conscience to feel guilty
  • fazer uma sessão de fotos/autógrafos to make a photo/autograph session
    fazer roleta russa to make russian roulette to play russian roulette → the expression roleta russa is non-compositional
    ter uma situação financeira/profissional/estável to have a financial/professional/stable situation

Notice that these suggestions also apply to LVCs whose nominal complements are introduced by prepositions (i.e. verb+PP LVCs). As usual, the preposition should be included if it is lexicalized and then the NP introduced by the preposition is analyzed exactly as described above.

If the complex dependent is an acronym, you may want to add the textual comment "PART" to indiate that only part of the full version is lexicalized (generally, the head), just like for contractions and compounds.