Skip to content

LEMMA attribute

The LEMMA attribute identifies a token by specifying the base form of a word contained in the knowledge graph which is called lemma.

The syntax is:

LEMMA("string1"[, "string2", ...])

where:

  • LEMMA is the attribute name and must be written in uppercase.
  • string# refers to any sequence of alphabetical characters, numbers and punctuation marks. Any of the strings to be recognized in a document can be made up of one or several words but must be written between quotation marks.

A rule using the LEMMA attribute is valid only if the specified string is found in the knowledge graph, unknown words are not accepted. Strings that are not found in the knowledge graph can only be managed with the KEYWORD or PATTERN attributes.

The LEMMA attribute allows the user to specify the base form of a word contained in the knowledge graph and match all of its inflected forms. For nouns, the base form is the singular form (the lemma child matches the text children). For verbs, the base form is the infinitive (go matches went, goes, going). For adverbs and adjectives, the base form also matches comparative and superlative forms (the lemma strong matches stronger and strongest).

The match for lemmas is case sensitive. In fact, since the entities to be matched are in the knowledge graph, the string must therefore be typed as it appears in the semantic network. Nouns are often written in lower-case while proper nouns in upper-case. It is always a good idea to check the knowledge graph first to verify how a lemma is written (compare the noun, jaguar, an animal to the proper noun, Jaguar, a luxury car maker).

For example:

LEMMA("coach")

Since coach is the base form for both the noun coach and the verb to coach, the attribute above matches any of the following tokens: coach, coaches, coached, coaching, and so on.

LEMMA("dog", "sheep dog", "Bordeaux Mastiff")

The above contains a list of three strings: dog and sheep dog are contained in the knowledge graph, but Bordeaux Mastiff is not. This would therefore generate an error. Also, sheep dog is an example of a lemma composed of two or more words contained in the knowledge graph as a whole element. This is called collocation, which is a sequence of words that often co-occur in a language and become fully fixed expressions through repeated use. Other examples of collocations present in the knowledge graph are credit card, boarding school and public finance.

LEMMA("State", "Republic of Trinidad and Tobago")

The above is an example of a noun lemma written in uppercase. State is often written in uppercase when referring to a nation, while the Republic of Trinidad and Tobago is a proper noun for a country and must always be written in uppercase. Since these entities are case sensitive, they must always be written as they are found in the knowledge graph.