Annotation guidelines (version 2.0)
Used by the
corpora annotated for multiword expressions
Identifying multiword tokens
The relation between words and tokens is not always 1-to-1. If a single token contains more than one word then it is a potential MWE. For the purpose of MWE annotation it is, therefore, important to provide a possibly clear-cut definition of a word. This section contains language-specific tests for identifying multiword tokens (MWTs). Currently the tests concern Swedish.
Swedish-specific tests for identifying MWTs
Test MWT.SV.1 - [NNC-MWT] - Noun+noun compound
Is the candidate a noun+noun compound, i.e. does it function as a noun, and consist of two (or more) components that are all nouns (and that can function as stand-alone nouns)? Note that modifier nouns may occur in a compounding form.
- it is a MWT
- go to the next test
skol|boks|hylla school book shelf school bookshelf
Test MWT.SV.2 - [SPLIT-MWT] - Splittable MWT
Split the candidate token into its component parts. Can it be used as an expression in the split form (possibly with slightly shifted semantics)? In some cases, a direct split is not possible. In such cases, it is permissible to change the word order and to insert function words, but not any additional content words.
- it is a MWT
- go to the next test
avbryta off-breakcancel, bryta av break offbreak off
Test MWT.SV.3 - [DEVERBAL_SPLIT-MWT] - Splittable as a deverbal expression
For candidates that can potentially have a deverbal form (nominal, adjectival and adverbial expressions), is the deverbal form either split, or splittable (according to the definition in MWT.SV.2)?
- it is not an MWT. Note that the answer might be no in two cases:
- The current expression cannot be deverbal
- The deverbalization either does not result in an existing expression, or it cannot be split with the semantics kept the same (or with a slight shift)
- it is MWT, but only the decision rules for deverbal expressions should be applied
allaredan all ready already 'allaredan' is an adverb that cannot be verbailzed