Skip to content

Deep linguistic analysis output

The deep linguistic analysis resource returns a JSON object with this format:

{
    "success": Boolean success flag,
    "data": {
        "content": analyzed text,
        "language": language code,
        "version": technology version info,
        "knowledge": [],
        "tokens": [],
        "phrases": [],
        "sentences": [],
        "paragraphs": []
    }
}

For the description of the contents, language and version properties, see the API resources output overview.

The paragraphs, sentences, phrases and tokens arrays are produced by the text subdivision process.
The items of the tokens array are then enriched by the other deep linguistic analysis processes: part-of-speech tagging, morphological analysis, lemmatization, syntactic analysis and semantic analysis.
The knowledge array contains Knowledge Graph data as a result of the semantic analysis process.

The contents of these arrays are described below.

tokens

The tokens array contains an item for every token detected. Each item has a format like this:

{
    "syncon": 62653,
    "start": 74,
    "end": 83,
    "type": "NOU",
    "lemma": "long time",
    "pos": "NOUN",
    "dependency": {
        "id": 11,
        "head": 7,
        "label": "nmod"
    },
    "morphology": "Number=Sing",
    "paragraph": 0,
    "sentence": 0,
    "phrase": 4,
    "atoms": [
        {
            "start": 74,
            "end": 78,
            "type": "ADJ",
            "lemma": "long"
        },
        {
            "start": 79,
            "end": 83,
            "type": "NOU",
            "lemma": "time"
        }
    ]
}
  • The syncon property is the outcome of the semantic analysis process. Its value is the ID of the corresponding syncon in the Knowledge Graph. The -1 value is attributed to tokens that do not have a corresponding syncon. A positive value has a match in the value of the syncon property of an entry in the knowledge array.
  • type is the result of custom part-of-speech tagging.
  • lemma is the result of lemmatization.
  • pos is the result of standard part-of-speech tagging.
  • dependency is the result of syntactic analysis.
    • id represents the index of the token in the text.
    • dep specifies the dependency relation with another token according to the Universal Dependencies conventions.
    • head identifies the token that receives the relation the relation. Its value corresponds to the value of the id property of another token, the only exception being the root token—the one with the depproperty set to root—for which head and id have the same value.
  • morphology is the result of morphological analysis.
  • start, end, phrase, sentence and paragraph are the result of text subdivision process.
    • start and end are the positions of the token text in the analyzed text, which is the value of the content property of the outer data object.
    • phrase is the phrase containing the token; it's the zero-based index of the phrase in the phrases array.
    • sentence is the sentence containing the token; it's the zero-based index of the sentence in the sentences array.
    • paragraph is the paragraph containing the token; it's the zero-based index of the paragraph in the paragraphs array.

In the case of collocations, the token object can contain the atoms array. There's an item for every word of the collocation in the atoms array and in each item of the atoms array:

Sometimes the semantic analysis process determines that a token is a named entity—for example: a person's name—even if there is no corresponding concept in the Knowledge Graph. In this case the syncon property is set to -1, but the token has an additional vsyn property. For example:

{
    "syncon": -1,
    "vsyn": {
        "id": -436106,
        "parent": 73303
    },
    "start": 0,
    "end": 19,
    "type": "NPR.NPH",
    "lemma": "Mauricio Pochettino",
    ...

This property, whose name means "virtual syncon", is an object with two properties:

  • id is a negative number generated by the semantic analysis process and assigned to all tokens considered as occurrences of the same entity. It is not the ID of a Knowledge Graph syncon.
  • parent is the number of a Knowledge Graph syncon which, conceptually, is the parent of the concept expressed by the token. For example, if the token has been recognized as a person's name, its parent is the concept of person. The parent syncon data is located in the knowledge array.

phrases

The phrases array is created and populated by the text subdivision process. It contains an item for every phrase detected. For example, the phrase:


Michael Jordan was one of the best basketball players of all time.

corresponds to an array item like this:

{
    "tokens": [
        7,
        8,
        9
    ],
    "type": "PP",
    "start": 54,
    "end": 65
}

The tokens array contains the zero-based indexes of the constituent tokens. For example, token 7 is the 8th token. type specifies the phrase type. start and end are the positions of the phrase in the analyzed text, which is the value of the content property of the outer data object.

sentences

The sentences array is created and populated by the text subdivision process. It contains an item for every sentence detected. For example, this sentence:

Michael Jordan was one of the best basketball players of all time.

corresponds to an array item like this:

{
    "phrases": [
        0,
        1,
        2,
        3,
        4,
        5
    ],
    "start": 0,
    "end": 66
}

The phrases array contains the zero-based indexes of the constituent phrases. start and end are the positions of the sentence in the analyzed text, which is the value of the content property of the outer data object.

paragraphs

The paragraphs array is created and populated by the text subdivision process. It contains an item for every paragraph detected. For example this text:

Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.

Michael Jordan was also a baseball player and an actor.

contains two paragraphs and the corresponding array is something like:

"paragraphs": [
    {
        "sentences": [
            0,
            1
        ],
        "start": 0,
        "end": 176
    },
    {
        "sentences": [
            2
        ],
        "start": 177,
        "end": 232
    }
]

The sentences array in each item contains the zero-based indexes of the constituent sentences. start and end are the positions of the paragraph in the analyzed text, which is the value of the content property of the outer data object.

knowledge

The knowledge array contains Knowledge Graph information about the syncons associated with the tokens.

The link between a token and the corresponding entry in this array is represented by the value of the syncon property both objects have in common, for example:

Token:

{
    "atoms": [
        {
            "end": 45,
            "lemma": "basketball",
            "start": 35,
            "type": "NOU"
        },
        {
            "end": 53,
            "lemma": "player",
            "start": 46,
            "type": "NOU"
        }
    ],
    "dependency": {
        "head": 2,
        "id": 6,
        "label": "nmod"
    },
    "end": 53,
    "lemma": "basketball player",
    "morphology": "Number=Plur",
    "paragraph": 0,
    "phrase": 2,
    "pos": "NOUN",
    "sentence": 0,
    "start": 35,
    "syncon": 41583,
    "type": "NOU"
}

Corresponding entry in the knowledge array:

{
    "label": "person.athlete.basketball_player",
    "properties": [
        {
            "type": "WikiDataId",
            "value": "Q3665646"
        }
    ],
    "syncon": 41583
}

It's a "many-to-one" relationship since multiple tokens can have the same syncon ID, but there's only one entry in the knowledge array for a given syncon, so the knowledge array is a reference table.
For example, if a text contains several occurrences of basketball player, each occurrence corresponds to a separate token, but all tokens "point" to the same entry in the knowledge array.

Tokens with the syncon property set to -1 have no corresponding entry in the knowledge array.

Each entry in the array has a format like this:

{
    "label": "person",
    "properties": [
        {
            "type": "WikiDataId",
            "value": "Q215627"
        }
    ],
    "syncon": 73282
}

The label property is a textual rendering of the general conceptual category for the syncon in the Knowledge Graph.

The properties array contains the outcome of Knowledge linking. Each item has two properties, type and value. type specifies the knowledge base, value is the property value. Possible knowledge bases and interpretations of the value property follow.

type value
Coordinate Latitude and longitude
WikiDataId Wikipedia article ID
DBpediaId URL of the DBPedia content
GeoNamesId ID of the record in the GeoNames database