Deep linguistic analysis output
The deep linguistic analysis resource returns a JSON object with this format:
{
"success": Boolean success flag,
"data": {
"content": analyzed text,
"language": language code,
"version": technology version info,
"knowledge": [],
"tokens": [],
"phrases": [],
"sentences": [],
"paragraphs": []
}
}
For the description of the contents, language and version properties, see the API resources output overview.
The knowledge array contains Knowledge Graph data as a result of the semantic analysis process. Its contents are described in the article about the output of full analysis.
The paragraphs, sentences, phrases and tokens arrays are produced by the text subdivision process.
The items of the tokens array are then enriched by the other deep linguistic analysis processes: part-of-speech tagging, morphological analysis, lemmatization, syntactic analysis and semantic analysis.
The contents of these arrays are described below.
tokens
The tokens array contains an item for every token detected. Each item has a format like this:
{
"syncon": 62653,
"start": 74,
"end": 83,
"type": "NOU",
"lemma": "long time",
"pos": "NOUN",
"dependency": {
"id": 11,
"head": 7,
"label": "nmod"
},
"morphology": "Number=Sing",
"paragraph": 0,
"sentence": 0,
"phrase": 4,
"atoms": [
{
"start": 74,
"end": 78,
"type": "ADJ",
"lemma": "long"
},
{
"start": 79,
"end": 83,
"type": "NOU",
"lemma": "time"
}
]
}
- The
synconproperty is the outcome of the semantic analysis process. Its value is the ID of the corresponding syncon in the Knowledge Graph. The -1 value is attributed to tokens that do not have a corresponding syncon. A positive value has a match in the value of thesynconproperty of an entry in theknowledgearray. typeis the result of custom part-of-speech tagging.lemmais the result of lemmatization.posis the result of standard part-of-speech tagging.dependencyis the result of syntactic analysis.idrepresents the index of the token in the text.depspecifies the dependency relation with another token according to the Universal Dependencies conventions.headidentifies the token that receives the relation the relation. Its value corresponds to the value of theidproperty of another token, the only exception being the root token—the one with thedepproperty set toroot—for whichheadandidhave the same value.
morphologyis the result of morphological analysis.start,end,phrase,sentenceandparagraphare the result of text subdivision process.startandendare the positions of the token text in the analyzed text, which is the value of thecontentproperty of the outerdataobject.phraseis the phrase containing the token; it's the zero-based index of the phrase in thephrasesarray.sentenceis the sentence containing the token; it's the zero-based index of the sentence in thesentencesarray.paragraphis the paragraph containing the token; it's the zero-based index of the paragraph in theparagraphsarray.
In the case of collocations, the token object can contain the atoms array. There's an item for every word of the collocation in the atoms array and in each item of the atoms array:
startandendare the result of text subdivision process. They represent the positions of the atom text in the analyzed text, which is the value of thecontentproperty of the outerdataobjecttypeis the the result of custom part-of-speech tagging.lemmaproperty is the result of lemmatization for to the word.
Sometimes the semantic analysis process determines that a token is a named entity—for example: a person's name—even if there is no corresponding concept in the Knowledge Graph.
In this case the syncon property is set to -1, but the token has an additional vsyn property. For example:
{
"syncon": -1,
"vsyn": {
"id": -436106,
"parent": 73303
},
"start": 0,
"end": 19,
"type": "NPR.NPH",
"lemma": "Mauricio Pochettino",
...
This property, whose name means "virtual syncon", is an object with two properties:
idis a negative number generated by the semantic analysis process and assigned to all tokens considered as occurrences of the same entity. It is not the ID of a Knowledge Graph syncon.parentis the number of a Knowledge Graph syncon which, conceptually, is the parent of the concept expressed by the token. For example, if the token has been recognized as a person's name, its parent is the concept of person. The parent syncon data is located in theknowledgearray.
phrases
The phrases array is created and populated by the text subdivision process.
It contains an item for every phrase detected. For example, the phrase:
Michael Jordan was one of the best basketball players of all time.
corresponds to an array item like this:
{
"tokens": [
7,
8,
9
],
"type": "PP",
"start": 54,
"end": 65
}
The tokens array contains the zero-based indexes of the constituent tokens. For example, token 7 is the 8th token.
type specifies the phrase type.
start and end are the positions of the phrase in the analyzed text, which is the value of the content property of the outer data object.
sentences
The sentences array is created and populated by the text subdivision process.
It contains an item for every sentence detected. For example, this sentence:
Michael Jordan was one of the best basketball players of all time.
corresponds to an array item like this:
{
"phrases": [
0,
1,
2,
3,
4,
5
],
"start": 0,
"end": 66
}
The phrases array contains the zero-based indexes of the constituent phrases.
start and end are the positions of the sentence in the analyzed text, which is the value of the content property of the outer data object.
paragraphs
The paragraphs array is created and populated by the text subdivision process.
It contains an item for every paragraph detected. For example this text:
Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
Michael Jordan was also a baseball player and an actor.
contains two paragraphs and the corresponding array is something like:
"paragraphs": [
{
"sentences": [
0,
1
],
"start": 0,
"end": 176
},
{
"sentences": [
2
],
"start": 177,
"end": 232
}
]
The sentences array in each item contains the zero-based indexes of the constituent sentences.
start and end are the positions of the paragraph in the analyzed text, which is the value of the content property of the outer data object.
knowledge
The knowledge array contains Knowledge Graph data for the items of the tokens array. Its contents are described in the article about the output of full analysis.