Deep linguistic analysis output
The deep linguistic analysis resource returns a JSON object with this format:
{
"success": Boolean success flag,
"data": {
"content": analyzed text,
"language": language code,
"version": technology version info,
"knowledge": [],
"tokens": [],
"phrases": [],
"sentences": [],
"paragraphs": []
}
}
For the description of the contents
, language
and version
properties, see the API resources output overview.
The knowledge
array contains Knowledge Graph data as a result of the semantic analysis process. Its contents are described in the article about the output of full analysis.
The paragraphs
, sentences
, phrases
and tokens
arrays are produced by the text subdivision process.
The items of the tokens
array are then enriched by the other deep linguistic analysis processes: part-of-speech tagging, morphological analysis, lemmatization, syntactic analysis and semantic analysis.
The contents of these arrays are described below.
tokens
The tokens
array contains an item for every token detected. Each item has a format like this:
{
"syncon": 62653,
"start": 74,
"end": 83,
"type": "NOU",
"lemma": "long time",
"pos": "NOUN",
"dependency": {
"id": 11,
"head": 7,
"label": "nmod"
},
"morphology": "Number=Sing",
"paragraph": 0,
"sentence": 0,
"phrase": 4,
"atoms": [
{
"start": 74,
"end": 78,
"type": "ADJ",
"lemma": "long"
},
{
"start": 79,
"end": 83,
"type": "NOU",
"lemma": "time"
}
]
}
- The
syncon
property is the outcome of the semantic analysis process. Its value is the ID of the corresponding syncon in the Knowledge Graph. The -1 value is attributed to tokens that do not have a corresponding syncon. A positive value has a match in the value of thesyncon
property of an entry in theknowledge
array. type
is the result of custom part-of-speech tagging.lemma
is the result of lemmatization.pos
is the result of standard part-of-speech tagging.dependency
is the result of syntactic analysis.id
represents the index of the token in the text.dep
specifies the dependency relation with another token according to the Universal Dependencies conventions.head
identifies the token that receives the relation the relation. Its value corresponds to the value of theid
property of another token, the only exception being the root token—the one with thedep
property set toroot
—for whichhead
andid
have the same value.
morphology
is the result of morphological analysis.start
,end
,phrase
,sentence
andparagraph
are the result of text subdivision process.start
andend
are the positions of the token text in the analyzed text, which is the value of thecontent
property of the outerdata
object.phrase
is the phrase containing the token; it's the zero-based index of the phrase in thephrases
array.sentence
is the sentence containing the token; it's the zero-based index of the sentence in thesentences
array.paragraph
is the paragraph containing the token; it's the zero-based index of the paragraph in theparagraphs
array.
In the case of collocations, the token object can contain the atoms
array. There's an item for every word of the collocation in the atoms array and in each item of the atoms
array:
start
andend
are the result of text subdivision process. They represent the positions of the atom text in the analyzed text, which is the value of thecontent
property of the outerdata
objecttype
is the the result of custom part-of-speech tagging.lemma
property is the result of lemmatization for to the word.
Sometimes the semantic analysis process determines that a token is a named entity—for example: a person's name—even if there is no corresponding concept in the Knowledge Graph.
In this case the syncon property is set to -1, but the token has an additional vsyn
property. For example:
{
"syncon": -1,
"vsyn": {
"id": -436106,
"parent": 73303
},
"start": 0,
"end": 19,
"type": "NPR.NPH",
"lemma": "Mauricio Pochettino",
...
This property, whose name means "virtual syncon", is an object with two properties:
id
is a negative number generated by the semantic analysis process and assigned to all tokens considered as occurrences of the same entity. It is not the ID of a Knowledge Graph syncon.parent
is the number of a Knowledge Graph syncon which, conceptually, is the parent of the concept expressed by the token. For example, if the token has been recognized as a person's name, its parent is the concept of person. The parent syncon data is located in theknowledge
array.
phrases
The phrases
array is created and populated by the text subdivision process.
It contains an item for every phrase detected. For example, the phrase:
Michael Jordan was one of the best basketball players of all time.
corresponds to an array item like this:
{
"tokens": [
7,
8,
9
],
"type": "PP",
"start": 54,
"end": 65
}
The tokens
array contains the zero-based indexes of the constituent tokens. For example, token 7
is the 8th token.
type
specifies the phrase type.
start
and end
are the positions of the phrase in the analyzed text, which is the value of the content
property of the outer data
object.
sentences
The sentences
array is created and populated by the text subdivision process.
It contains an item for every sentence detected. For example, this sentence:
Michael Jordan was one of the best basketball players of all time.
corresponds to an array item like this:
{
"phrases": [
0,
1,
2,
3,
4,
5
],
"start": 0,
"end": 66
}
The phrases
array contains the zero-based indexes of the constituent phrases.
start
and end
are the positions of the sentence in the analyzed text, which is the value of the content
property of the outer data
object.
paragraphs
The paragraphs
array is created and populated by the text subdivision process.
It contains an item for every paragraph detected. For example this text:
Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
Michael Jordan was also a baseball player and an actor.
contains two paragraphs and the corresponding array is something like:
"paragraphs": [
{
"sentences": [
0,
1
],
"start": 0,
"end": 176
},
{
"sentences": [
2
],
"start": 177,
"end": 232
}
]
The sentences
array in each item contains the zero-based indexes of the constituent sentences.
start
and end
are the positions of the paragraph in the analyzed text, which is the value of the content
property of the outer data
object.
knowledge
The knowledge
array contains Knowledge Graph data for the items of the tokens
array. Its contents are described in the article about the output of full analysis.