corpora of multiword expressions - version 1.2 (2020)
shared task on semi-supervised identification of verbal multiword expressions - edition 1.2 (2020)
Frequently Asked Questions (FAQ)
Annotators often face questions and challenging examples. When several annotators ask the same question, we will update the list of frequently asked questions.
However, we suggest that language teams set up another communication platform to deal with questions that are specific to a language. This can take the form of a shared online document, a wiki, a dedicated bug tracking system or mailing list. We also suggest keeping track of decisions taken considering borderline examples (with a list of expressions to which the decision applies). These should be kept in a centralized document or page that all annotators can access.
Whenever you think that a question can also be interesting to other languages, please notify the organizers and we will try to update this page.
- How to define an unexpected change in meaning?
- How to annotate lexicalized words which belong to contractions, compounds, and acronyms?
- How to annotate coordinated VMWEs sharing some components?
- How to annotate elliptical occurrences of VMWEs?
- How to annotate VMWEs that seem to belong to more than one category?
- How to annotate embedded VMWEs?
- Are existential expressions with there is/are considered VMWEs?
- How to categorize VMWEs which seem LVCs but do not pass all LVC tests?
- Why are verb+noun constructions with pure operator verbs (to commit, to make, to have etc.) considered LVCs?
- Does the IRV category include verbs with non-reflexive clitics?
- Should nominalizations of VMWEs be annotated?
- How to express hesitation between different VMWE categories?
- How can one decide what are the semantic arguments of a noun for borderline cases?
- How does one decide if a more or less frozen determiner is a lexicalized VMWE component?
- Should I annotate compound and serial verbs as VMWEs? Of which category?
- If an LVC contains a complex (fixed) NP as a dependent, should I include the whole NP or just the head?
- In an LVC candidate, if the verb adds aspect to the predicative noun, does it imply failing Test LVC.3?
- In the LVC decision tree, should I test that the noun keeps its original meaning?
Check the glossary entry that defines unexpected change in meaning
In some languages adpositions (pre- or post-positions), clitics and determiners are subject to contractions (i.e. they yield multiword tokens, MWTs). If they are properly split by the tokenizer, only the lexicalized parts of each contraction should be annotated. If you use FLAT for annotating, the display of split contractions is twofold: both in its folded and unfolded version. Only the latter should be subject to annotation, e.g. Jean bénéficie du de le traitement Jean benefits from the treatment, Jean donne du de le grain à moudre à son fils Jean gives grain to grind to his sonJean gives an occasion to act to his son.
Sometimes, however, tokenizers might not handle contraction splitting properly. In this case, a lexicalized component of a VMWE can be merged with an external word:
- haberse suicidado have+REFL suicided committed suicide
- aller au (à+le) secours go to+the rescueto rescue
A similar problem occurs in languages with productive compounding, where a lexicalized component of a VMWE and a free modifier can build up a multitoken word (since compound splitting might not be a standard feature of a tokenizer):
unter Drogeneinfluss stehen to be under the influence of drugs
Heisshunger haben to have hot hunger to be ravenously hungry
Yet another related phenomenon concerns acronyms whose spelled-out versions may contain predicative nouns which in the abbreviated versions boil down to single letters:
the patient has AIDS (acquired immunodeficiency syndrome)
the book underwent OCR (optical character recognition)
the program carries out a PCA (principal component analysis)
- el paciente presenta un SCA (síndrome coronario agudo)
le patient présente un SCA (Syndrome coronarien aigu)
le patient fait un AVC (accident vasculaire cérébral)
Since the current annotation format is token-based, we prohibit correcting tokenization errors and compound splitting by the annotators for the sake of coherence. Therefore the annotation of such contractions, compounds and acronyms finds no fully satisfactory solution in our schema. We propose to annotate a whole MWT each time it contains a word which is part of a VMWE. Annotators should add a textual comment about the mixed status of this MWT:
Drogeneinfluss → MWT containing a lexicalized VMWE component Einfluss and an external word Drogen
Heisshunger → MWT containing a lexicalized VMWE Hunger and an additional modifier heiss
- haberse → MWT containing a lexicalized VMWE component se and an external word haber
A component shared by two or more coordinated VMWEs should be annotated as belonging to both of them.
- Regeln und Richtlinien aufstellen to set up rules and guidelines to draw up rules and guidelines → aufstellen must be annotated both as part of to Regeln aufstellen to lay down rules and of Richtlinien aufstellen to draw up guidelines
- κάναμε βόλτες και ένα σωρό ψώνια στο εμπορικό κέντρο → κάναμε we made must be annotated both as part of κάναμε βόλτες we made walksand of κάναμε ψώνια we were buying
- to have a walk or a ride → have must be annotated both as part of to have a walk and of to have a ride
- darse un baño o una ducha give a bath or a shower to have a bath or a shower → darse must be annotated both as part of darse un baño and of darse una ducha
- hitz eta lan egin word and work do to speak and work → egin must be annotated as part of both hitz egin and lan egin.
- odprawić mszę i pokutę celebrate a mass and a penance→ odprawić should be annotated both as part of odprawić mszę to celebrate a mass and of odprawić pokutę to celebrate a penance
- a cere cuiva explicații sau socoteală to ask someone.to explanations or account → cere should be annotated both as part of cere explicații and cere socoteală
- imeti dober želodec in dobre živce to have a good stomach to bear something well and good nerves to be mentally strong → imeti have must be annotated both as part of imeti dober želodec and of imeti dobre živce
Such hesitation issues should normally be solved by the structural tests. For instance, consider the German expression sich eine Frage stellen SELF a question put to doubt. It may seem to belong to both IRV, since sich is required only if stellen co-occurs with Frage, and LVC, since Frage keeps its original meaning and stellen brings no additional meaning. However, test S.2 [1DEP] indicates that an expression like this should be annotated as a VID, since the verb has more than one lexicalized syntactic dependent.
Similarly, the French expression avoir peur have fear to be afraid seems to have features of a VID. Unlike most LVCs, it does not allow a determiner *avoir une peur have a fear , except when the noun is modified avoir une grande peur have a great fear . However, test S.4 [CATEG] in the generic decision tree 2, and the LVC-specific decision tree indicate that it belongs to the LVC category.
Candidate VMWEs embedded in other VMWEs should be annotated only if they have a VMWE status also outside the particular context. For instance, the VMWE to let the cat out of the bag should be annotated as a VID, and its embedded VMWE to let out as a VPC.
On the other hand, the French expression se faire des idées SELF make DET.PL ideas to imagine things which are not true, se faire should not be annotated as IRV, since it is not inherently reflexive as a standalone verb+clitic combination.
Hesitations about a possible LVC status can arise with respect to existential constructions with nouns introducing events or properties (see test LVC.1 [N-PRED]) as in:
- es gibt Beschwerden there are complaints
- υπάρχουν κατηγορίες there-are problems there are problems
- there are complaints
- hay quejas there are complaints
- arazoak daude problems there-are there are problems
- il existe des plaintes it there has complaints there are complaints
- há queixas has complaints there are complaints
Namely, the noun keeps its original sense and the existential verb to be or to have brings no additional meaning. However, a candidate LVC must also pass test LVC.4 [V-REDUC]. This requires the modification of the noun by the verb's subject, which is impossible with impersonal and empty subjects like there. Therefore, such candidates cannot be LVCs.
Note, however, that existential expressions themselves can be VMWEs of the VID type. For instance, in the French example il y a des plaintes it there has complaints there are complaints, two dependents of the verb a has are lexicalized: il it and y there , therefore it is a VID (see test S.2 [1DEP]).
If at least one of the five LVC tests (9 to 13) is not passed, the candidate is not considered an LVC. For the sake of a deterministic VMWE categorization and higher inter-annotator agreement, we admit a definition of an LVC which might seem more restrictive than some linguistic studies usually assume. Thus, we exclude from the LVC scope:
- expressions in which the verb's syntactic subject is not necessarily the noun's semantic subject, like to give courage or to make an impression. These candidates do not pass test LVC.4 [V-REDUC].
- expressions where the lexicalized nominal dependent of the verb is its subject, as in the problem lies in something; these candidates do not pass test LVC.4 [V-REDUC].
- expressions with aspectual verbs, as in to start, to pursue, to stop a walk. These do not pass test LVC.3 [V-LIGHT] since they add (aspectual) semantics to the noun. The only exception is when the noun itself is already aspectual, as in to come into bloom
Pure operator verbs, i.e. such verbs which never have any semantics per se but only carry the grammatical (tense, mood etc.) information, seem to contradict the intuition behind a VMWE. Namely, they usually select a whole semantic class of nouns. For instance to commit selects any negative act (a crime, a suicide, a theft) and to perform selects any activity (a task, an experiment, a miracle). In this sense, their complements resemble open slots and the whole combinations resemble collocations. However, for the sake of a deterministic VMWE categorization and higher inter-annotator agreement, we do include verb+noun combinations with pure operator verbs, such as to commit a crime and to perform a task, into the LVC category. This is because such combinations pass all tests (LVC.0 through LVC.4). We found no other reliable tests which would distinguish such productive cases from less productive ones like to make a decision. In particular, some studies (e.g. Bonial 2014) show that there exist no truly productive light verbs. Therefore, all examples cited here to be classified as LVCs.
No, the IRV category only includes (some) combinations of a head verb with a reflexive clitic. As indicated in the borderline cases page of IRV category, other pronouns, whenever lexicalized, trigger the VID category. Recall that whenever more than one dependent of the verb is lexicalized (including or not a reflexive clitic), the VMWE is always categorized as an ID
- sich Fragen stellen SELF questions put to doubt
- s'en aller SELF of-there go to leave
- ucvreti jo to escape her to escape something/someone by running
The only nominal VMWE variants within our annotation scope are those:
- headed by the gerund stemming from the head verb of the VMWE - taking of the decision, and
- in which a noun stemming from a VMWE is modified by a participle or a relative clause headed by the verb stemming from the same VMWE - the decisions taken yesterday, the decision which he took.
Other nominalizations are excluded:
- Wortbruch word-break a promise which has not been hold
- a break-down, a forget-me-not
toma de decisiones taking of decisions decision making
puesta a punto setting to point set-up
- izen-emate, esker-egite name-giving, thanks-doing inscription, thanks-giving
- la prise en compte the taking into account the fact of taking something into account, peut-être may-be maybe, porte-feuilles carry-sheets wallet
- zabawa czyimś kosztem a play at someone else's expenses derived from bawić się czyimś kosztem to enjoy oneself at someone else's expenses
- un pierde-vară a loses-summer a lazy person
- šala na tuj račun a joke at someone else's expenses derived from šaliti se na tuj račun to play a joke on someone
For practical reasons (e.g. compatibility with an existing annotation, or usefulness for a particular application) they can be considered language-specific VMWEs but then a new category should be defined for them, so as to keep the universal and the quasi-universal categories intact
Once identified in a text, each VMWE is to be assigned to exactly one category. Note that in this version of the guidelines we no longer admit "hesitation labels" (e.g. LVC/VID) used in the pilot annotation. Hesitation can, however, be expressed in a comment and a particular value of the annotator's confidence assigned to a particular VMWE occurrence.
The goal of test LVC.1 is to identify whether a noun is predicative, that is, whether it requires at least one semantic argument. For many classes of abstract nouns, however, it can be tricky to apply the test. We advise listing in a separate document those classes of nouns that pass test LVC.1 in your language. Language teams can also provide links to the documentation of semantic annotation projects such as NomBank for English, which usually include tests and descriptions that help identifying semantic arguments.
We suggest considering that the following categories pass test LVC.1:
Illnesses, symptoms and health conditions:
Ο Γιάννης έχει συνάχι = ο Γιάννης είναι άρρωστος (αρρώστεια is a hypernym of συνάχι)
Ο Γιάννης έχει σχέση με κάποιον = Ο Γιάννης σχετίζεται με κάποιον
Ο Γιάννης έχει επαφές με κάποιον = Ο Γιάννης επικοινωνεί με κάποιον (επικοινωνία is a synonym of επαφή)
Mental content (internal to a cognizer):
Ο Γιάννης έχει ανησυχία = Ο Γιάννης ανησυχεί
Ο Γιάννης έχει μια ιδέα = Ο Γιάννης σκέφτεται (σκέψη is a synonym of ιδέα)
Ο Γιάννης έχει την άποψη = Ο Γιάννης κρίνει (κρίση is a synonym of άποψη)
Illnesses, symptoms and health conditions:
John has a flu = John is ill (illness is a hypernym of flu)
John has contact with somebody = John contacts somebody
John has an affair with somebody = John is involved with somebody (involvement is a synonym of affair)
Mental content (internal to a cognizer):
John has a worry = John worries
John has an idea = John thinks (thought is a synonym of idea)
John has an opinion = John believes (belief is a synonym of opinion)
Mental content (internal to a cognizer):
Miha je v dvomih Miha is in doubts = Miha dvomi Miha doubts
Miha je mnenja Miha is of opinion = Miha meni Miha believes
Miha ima predstavo/pojma Miha has an idea = Miha meni Miha thinks (predstava, pojem are synonyms of idea in this context)
Please notice that events and states that have no semantic arguments do not pass test LVC.1, even if they have verbal/adjectival paraphrases:
Natural phenomena: rain, snow, tornado, flood, earthquake
Informational content (external to a cognizer): information, news
Natural phenomena: dež, sneg, tornado, poplava, potres rain, snow, tornado, flood, earthquake
Informational content (external to a cognizer): informacije, novice information, news
Finally, notice that not any verb + predicative noun combination forms an LVC. Additionally, the verb needs to be "light", not adding semantics to the noun. The remaining LVC tests guarantee this.
Most of the time, it is easy to test whether a determiner is lexicalized by searching alternatives in corpora (or on the web). For instance, the is lexicalized in to kick the bucket because searches for other determiners (this, a, some, three, many, etc.) either do not return any result or return only literal uses of this verb phrase.
However, borderline cases do exist, in which alternatives are rare but possible, specially for LVCs and decomposable VIDs. For instance, while the standard form of the idiom spill the beans forbids some determiners (#spill three/twenty beans), it is possible to find some variation (spill these/many/all/my/his/more/no beans).
We argue that the selection of some determiners (but not all) by a VMWE is comparable to selected prepositions for verbs. Thus, it can be seen as a regular grammatical phenomenon, suggesting that when the determiner varies, then it should not be included in the annotation scope. Possesive pronouns (my, her, their, etc.) and reflexive clitics (myself, herself, themselves, etc.) are exceptions to this rule (see also Section 1.4). Namely, when they are constrained to agree in number and person with the subject (I do my best, *I do your best), they are realized by different lexemes, i.e., strictly speaking, they are not lexicalized. We consider, however, that - with respect to lexicalization - they constitute single lexemes inflecting for number and gender.
Patricular language teams may of course adopt their own criteria for annotating partly frozen determiners. Then, these decisions should be documented in language-specific guidelines.
It depends. In many Indo-European languages (including Germanic, Romance and Balto-Slavic families), verbal chains using auxiliary and modal verbs are used to express tense, mood, modality and aspect. This is a regular linguistic phenomenon, fully productive, that can be applied to any verb and should not be annotated at all.
On the other hand, some languages have idiomatic compound and serial verbs, that is, VMWEs whose lexicalized components are two verbs, and where of them does not express tense, mood, modality and/or aspect with respect to the other one. Therefore, we have created a new category in edition 1.1 to annotate these constructions, called multi-verb construction (MVC), covering examples such as:
- will sagen want to say that is to say
to let go
to make do
- querer decir want say to mean
laisser tomber let fall to give up
vouloir dire want say to mean
lasciar andare let go to unhand
voler dire want say to mean
dać komuś żyćto let someone livenot to bother someone
można wytrzymaćone can standthe situatiion is reasonably good
querer dizer want say to mean
ouvir falar hear speak to know/remember vaguely
The guidelines determine that only lexicalized components should be annotated. Therefore, we suggest that, in such cases, if the NP is compositional, only the head of the NP is included in the scope of the LVC. This may lead to the annotation of odd LVCs that actually never occur by themselves without a modifier. This is not a problem and is already the case for other VMWEs, e.g. the ones that only occur with a determiner, but the determiner is not lexicalized. The only cases where the NP should be included as a whole is if the complement is a non-compositional MWE, so that it would not make any sense to annotate only the head.
παίζω το χαρτί του ευρωσκεπτικισμού to-play the paper the.SG.GEN euroscepticism.SG.GEN to use the asset of euroscepticism, to use euroscepticism as an asset
κάνω στάση εργασίας to-make stop work.SG.GEN to go on strike, to strike → the expression στάση εργασίας is non-compositional (term)
- darse una larga ducha caliente give.self a long shower hot to have a long and hot shower
présenter un Syndrome Coronairien Aigu to present an acute coronary syndrome
mener une vie de débauche to have a life of pleasures
faire un faux pas make a false step to commit a faux pas → the expression faux pas is non-compositional
- mieć wyrzuty sumienia to have reproaches of the conscience to feel guilty
fazer uma sessão de fotos/autógrafos to make a photo/autograph session
fazer roleta russa to make russian roulette to play russian roulette → the expression roleta russa is non-compositional
ter uma situação financeira/profissional/estável to have a financial/professional/stable situation
Notice that these suggestions also apply to LVCs whose nominal complements are introduced by prepositions (i.e. verb+PP LVCs). As usual, the preposition should be included if it is lexicalized and then the NP introduced by the preposition is analyzed exactly as described above.
If the complex dependent is an acronym, you may want to add the textual comment "PART" to indiate that only part of the full version is lexicalized (generally, the head), just like for contractions and compounds.
Depending on the language, aspect can be realised by various lexical, morphological and syntactic means.
- We consider aspect a morpological feature in the following cases:
- Perfective or continuous aspect introduced by inflection and/or analytical tenses:
John was making a presentation
he called her while having a walk
- Perfective or imperfective aspect inherent to the verb (independently of its inflected form), recognisable either by a prefix or by an ending:
pełnić rolęfulfil.IMPERF a roleto play a role
wypełnić rolęfulfil.PERF a roleto play a role
wypełniać rolęfulfil.PERF a roleto play a role
Taja je postavljala vprašanjaTaja was asking questions
ves čas je dajal napačne napovedi he was always giving wrong forecasts
- We consider aspect a semantic feature in the following cases:
- Starting, continuation or completion is expressed by precise verbs which usually modify other verbs:
η Μαρία άρχισε τη συζήτηση Maria started the conversation
ο Γιάννης διέκοψε την κουβέντα John interrupted the discussion
Anthony started his presentation in advance
the weather interrupted the transmission twice
we kept our show regardless of the reactions
Tomaž je začel svoje predavanje Tomaž started his lecture
Politik je nadaljeval svojo napoved reform the politician continued his forecast about reforms
naredili bomo konec onesnaževanju we will make end to pollution we will put an end to pollution
In Test LVC.3, we verify whether the verb adds "light" semantics to the predicative noun. When aspect is expressed as a morphological feature, such as in the first item above, we consider that the verb is light and test LVC.3 passes. However, when aspect is a semantic feature rather than a morphological feature, test LVC.3 fails and we do not have an LVC.
The previous version (1.0) of the annotation guidelines contained Test 10 [N-SEM], which checked if the noun in an LVC candidate preserves one of its original senses. If it did not, the candidate was not an LVC.
In the current version of the guidelines we have abandoned this test because:
- it proved hard to establish the list original senses of a noun,
- this test was superfluous with respect to Test LVC.4 [V-REDUC],
- in some verbal idioms (VIDs) the noun also keeps its original sense, so the test can be misleading for the LVC vs. VID distinction.