Lemmatization
Lemmatization is the deep linguistic analysis process that tags tokens and atoms with their corresponding lemmas.
For example, for this sentence:
Michael Jordan was one of the best basketball players of all time.
lemmatization produces this output for detected tokens:
Token | Lemma |
---|---|
Michael Jordan |
Michael Jordan |
was |
(to) be |
one |
one |
of |
of |
the |
the |
best |
good |
basketball players |
basketball player |
of |
of |
all |
all |
time |
time |
. |
n/a |
In the case of collocations, lemmatization is also applied to constituent atoms.
For example the token:
basketball players
is a collocation composed of two atoms, for which lemmas are:
basketball
player
In the case of anaphoras, the lemma is that of the antecedent or postcedent.
For example, in the following text:
Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
lemmatization recognizes Jordan and he in the second sentence as anaphoras, both of which have Michael Jordan as their antecedent in the first sentence, so the returned lemma for the anaphoras will be Michael Jordan.
Lemmatization output is part of the JSON object returned by deep linguistic analysis.