Skip to content

Lemmatization

Lemmatization is the deep linguistic analysis process that tags tokens and atoms with their corresponding lemmas.

For example, for this sentence:

Michael Jordan was one of the best basketball players of all time.

lemmatization produces this output for detected tokens:

Token Lemma
Michael Jordan Michael Jordan
was (to) be
one one
of of
the the
best good
basketball players basketball player
of of
all all
time time
. n/a

In the case of collocations, lemmatization is also applied to constituent atoms.
For example the token:

basketball players

is a collocation composed of two atoms, for which lemmas are:

``` text
basketball
player

In the case of anaphoras, the lemma is that of the antecedent or postcedent.
For example, in the following text:

``` text Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.

lemmatization recognizes Jordan and he in the second sentence as anaphoras, both of which have Michael Jordan as their antecedent in the first sentence, so the returned lemma for the anaphoras will be Michael Jordan.

Lemmatization output is part of the JSON object returned by deep linguistic analysis.