Skip to content

Input for model blocks

First level keys

The top-level keys of the input JSON that a model block recognizes and can use depend on the presence of NL Core inside the model. If the model has this component, as it is in the case of symbolic models and basic mode ML models, it always recognizes these keys:

If the symbolic component is based on NL Core version 4.12 or later, the block also recognizes this key:

  • documentData (array)

    Tip

    You can determine the version of NL Core for a symbolic model by selecting Show resources in the editor or looking at the Resources area after selecting the model in the Models view of the main dashboard.

In general, the block always expects a text to analyze, so one key between text, sectionsText and documentLayout is mandatory (see details below), while the other keys are optional.

Advanced mode ML models don't have NL Core and the only input key they recognize is:

In this case the block doesn't expect a text to analyze: instead it expects text features, that is the outcome of the NLU analysis of a text.

text

text is text that must be analyzed by NL Core.
When input mapping is needed, this key is typically mapped, through the corresponding text input property, to:

text is alternative to documentLayout: if one of these keys is present in input, the other must be omitted.
text can be complemented by sections and sectionsText for Studio-generated symbolic models whose rules can distinguish between text sections.

documentLayout

documentLayout is an object with the same structure of the result key of Extract Converter processor output, so a model using it is typically preceded by an Extract Converter block and this key is mapped through the corresponding documentLayout input property to that output key.
It must be used for Studio-generated symbolic models with rules that leverage layout information and for extraction ML models trained with layout-based annotations.

Note

Any model with NL Core recognizes this key and is able to derive plain text to analyze from it, but there is no point in passing layout information to a model that is not specialized to leverage it.

If documentLayout is present in the input JSON, text, sections and sectionsText—which are alternative means of giving input text to the block—must be omitted.

sections

The sections key is optional and complementary to text. When present, it indicates the boundaries of text sections, for example:

"sections": [
    {
        "name": "TITLE",
        "start": 0,
        "end": 61
    },
    {
        "name": "BODY",
        "start": 62,
        "end": 2407
    }
]

Currently only symbolic models designed with Studio can contain hand-written symbolic rules that account for sections. In particular, with multiple sections, rules can be written that are triggered only by the text of a given section, while Platform generated rules have all the same scope—even if the input document has sections—that is the entire input text.

sections is an array. Each item corresponds to a section and it's an object with these properties:

  • name (string): section name.
  • start (integer): zero-based position of the first character in the section inside the value of text.
  • end (integer): zero-based position of the first character after the section inside the value of text.

If input mapping is needed, the expected mapping of the corresponding sections input property is mapped to a key of the workflow input or the modelName.document.sections property of an upstream model block which in turn received sections data.

sectionsText

The sectionsText key is text to be analyzed divided into sections, for example:

"sectionsText": [
    {
        "name": "TITLE",
        "text": "This is a title"
    },
    {
        "name": "BODY",
        "text": "This is the body"
    }
]

sectionsText is an array of objects. Each object has these properties:

  • name (string): section name
  • text (string): section text

The model builds plain text to analyze by concatenating the values of the text properties of the array items using a newline character as a separator.
If the text key is also set, the text obtained from sectionsText is appended to the one represented by text using a newline character as a separator, so the model receives a text that is the result of the concatenation of two texts. The model also receives automatically computed section boundaries referred to the concatenated text.

For example:

  • Value of text:

    We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty.
    
  • Value of sectionsText:

    [
        {
            "name": "TITLE",
            "text": "President John F. Kennedy delivered his inaugural address"
        }
    ]
    
  • Concatenated plain text:

    We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty.
    President John F. Kennedy delivered his inaugural address
    
  • Sections boundaries:

    • Section name: TITLE
    • Start: 142
    • End: 199

When input mapping is needed, the corresponding sectionsText input property is mapped to one key of the workflow input or to the modelName.document.sectionsText property of a model block which in turn received that data.

options

Theoptions object contains optional parameters that can be passed to the model to influence its behavior. They affect NL Core.

The most extensive structure that this object can have is this:

"allCategories": boolean,
"custom": object,
"disambiguation": {
  "flags": number
},
"output": object,
"rules": object

or, for old models, this:

"allCategories": boolean,
"custom": object

Old models have NL Core version 4.11 or lower.

Tip

You can determine the version of NL Core for a symbolic model by selecting Show resources in the editor or looking at the Resources area after selecting the model in the Models view of the main dashboard. For basic mode ML models, the version of NL Core is tied to that of the ML engine, which is visible when you select the model from the list.

All the components of this structure are optional; they are described below.

allCategories

Retained for backwards compatibility, this option is equivalent to the allCategories property of the rules object.

custom

Retained for backwards compatibility, this option is equivalent to the customOptions property of the rules object.

disambiguation

This is an advanced option for NL Core.
It is meant to be used with the support of your expert.ai technical contact should he determine that the tuning of low-level options can improve the quality of NLU analysis.
When used, this option contains, in its only flags parameter, a number representing one or more disambiguation options. Multiple options are combined in binary OR.

output

The most extensive structure that this object can have is the following:

 "output": {
    "analysis": string array,
    "features": string array,
    "knowledgeProperties": string array
}

All the components of this structure are optional.
These options affect the output of NL Core. The properties specified for this object override the values of corresponding functional properties of the model block. These are the correspondences:

  • analysis array items:

    The presence of items in the analysis array is equivalent to turn on corresponding functional properties in the model block, the absence is equivalent to turn them off.

    Item value Functional property
    relevants Output relevants
    sentiment Output sentiment
    relations Output relations
    segments Output segments

    Functional property Apply rules corresponds to two items, categories and extractions. If all you want are categorization results, specify only categories, while if you just want information extraction specify only extractions.

  • features array items:

    The presence of an item in the features array is equivalent to turn on the corresponding functional property in the model block, the absence is equivalent to turn it off.

    Item value Functional property
    syncpos Synchronize positions to original text
    dependency Output dependency tree
    knowledge Output knowledge
    externalIds Output external ids
    extradata Output rules extra data
    explanations Output explanations
    namespaces Output namespace metadata
    documentData Output document data
    layout Output layout information
  • knowledgeProperties array: this array replaces the value of the Required user properties for syncons functional property.

If the analysis array is empty, the model still performs document analysis, named entity recognition and keyphrase extraction with NL Core, producing output keys content, entities, language, mainLemmas, mainPhrases, mainSentences, mainSyncons, options, paragraphs, phrases, sentences, tokens, topics and version.

rules

The most extensive structure that this object can have is the following:

"rules": {
  "allCategories": boolean,
  "applyRules": boolean,
  "customOptions": object,
  "namespace": string
}

All the components of this structure are optional; they are described below.

allCategories

When its value is false, a categorization model returns only the categories with the highest scores, that is those with the winner property set to true. The default value is true.

applyRules

The value of this option overrides that of the Apply rules functional property.

customOptions

This object can be used to convey custom options to Studio-generated symbolic models and thesaurus models that access them via specific JavaScript code.

Thesaurus scoring algorithms configuration

Thesaurus models are based on NL Core and contain automatically generated JavaScript code which implements alternative scoring algorithms that can affect the confidence score of the extractions of thesaurus concepts. The JavaScript can also affect extraction scoring in general using thresholds and section boosts.
The scoring algorithms are based on configuration settings which can be changed with the following properties of customOptions:

  • scoreConfig
  • advancedScoreConfig
  • libraryScoreConfig
  • scorePostProcessingConfig

The properties correspond to object variables defined inside the JavaScript. By specifying one or more of the above objects in the input JSON you can override the default values of the properties of those variables which were set during model generation.

The scoreConfig object contains configuration settings for the thesaurus-based scoring algorithm, which is based on labels and relationships between concepts. It must be set like this (here with default values):

"scoreConfig": {
    "disableScore": false,
    "defaultScore": 1,
    "normalize": 100,
    "boostByHierarchy": {
        "byParent": 1,
        "byChildren": 0.5,
        "byRelated": 0.3
    },
    "boostByFrequency": true,
    "boostByLabel": {
        "matchPrefLabel": 1,
        "matchAltLabel": 0.5,
        "lengthMeasure": 0.1,
        "ignoreCase": true
    }
}

where:

  • disableScore (boolean): if true, the algorithm is not used.
  • defaultScore (number): default base score for all extractions. Ignored if boostByFrequency is true.
  • normalize (number): the final score of extraction will be normalized to a value in the range between 0 and the value of this parameter. Use value 0 to disable score normalization.
  • boostByHierarchy (object): its properties are multiplication factors that are applied to the base score based on the relationship between the extracted concept and other concepts in the thesaurus.

    • byParent (number): applied for every broader concept that is also extracted from the text. Value 0 is interpreted as no multiplication.
    • byChildren (number): applied for every narrower concept that is also extracted from the text. Value 0 is interpreted as no multiplication.
    • byRelated (number): applied for every non-hierarchically related concept that is also extracted from the text. Value 0 is interpreted as no multiplication.
  • boostByFrequency (boolean): when true, the base score is the concept frequency in the text.

  • boostByLabel (object): its properties determine how the base score is affected by the relationship between the extracted text and the concept labels.

    • matchPrefLabel (number): multiplication factor applied to the base score if the matching text is the preferred label.
    • matchAltLabel (number): multiplication factor applied to the base score if the matching text is one of alternative labels.
    • lengthMeasure (number): multiplication factor applied to the base score that is further multiplied by the number of tokens—separated by space—of the match.
    • ignoreCase (boolean): when true, the case is ignored when matching the text and the labels of the concept.

The advancedScoreConfig object contains configuration settings for the document-based scoring algorithm, which is based on the frequency and the position of the mentions of the concepts in the text. It must be set like this (here with default values):

"advancedScoreConfig": {
  "disableAdvancedScore": false,
  "parameters": {
    "k": 0.5,
    "b": 0.5,
    "avgE": 1.0,
    "posB": 0.25
  }
}

The formula used by the algorithm is:

where:

  • posc is the zero-based start position in the text of the extraction.
  • lt is the length of the text in characters.
  • posB is a parameter (see below).
  • k is a parameter (see below).
  • ec is the number of extraction for the concept.
  • emax is the number of mentions of the concept that has more extractions.
  • b is a parameter (see below).
  • et is the total number of extractions.

The properties of the advancedScoreConfig object are:

  • disableAdvancedScore (boolean): if true, the algorithm is not used.
  • parameters (object): its properties are tuning parameters for the score.

    • k (number): positive tuning parameter to normalize the frequency of the concept in the text. Value between 0 and 1, where 0 means pure relative frequency.
    • b (number): positive tuning parameter that determines the scaling by text length. Value between 0 and 1, where 1 corresponds to fully scaling the concept weight by the text length, and 0 corresponds to no length normalization.
    • avgE (number): average number of extractions per document in a reference corpus. Number greater than 0.
    • posB (number): position bias. It boosts the score of extractions occurring within the initial posB percent of the text. Number between 0 and 1.

The libraryScoreConfig object contains configuration settings for an algorithm based on TF-IDF.
It must be set like this (here with default values):

"libraryScoreConfig" : {
  "disableLibraryScore" : false,
  "idf" : {...}
};

where:

  • disableLibraryScore (boolean): if true, the algorithm is not used.
  • idf (object): pre-computed inverse document frequencies for terms. Each property of the object is a term and the value of the property is the inverse document frequency. For example:
{"investment":5.2,"interest":8.7,"stock":3.4,"dividend":6.1,"portfolio":9.3,"asset allocation":2.9,"equity":7.5,"capital gains":4.6,"bond":8.2,"liquidity":2.3,"mutual fund":6.7,"market value":5.8,"fixed income":3.1,"risk management":9.6,"hedge fund":7.3,"credit rating":4.9,"financial planner":6.5,"pension":3.8,"retirement account":8.9,"401(k)":5.6,"debt":2.7,"budget":9.2,"savings account":4.3,"tax deduction":7.9,"inflation":3.6,"insurance":9.1,"credit score":2.4,"real estate":8.4,"net worth":6.3,"cash flow":5.5,"economic indicators":4.7,"asset management":7.7,"leverage":3.9,"dollar cost averaging":8.6,"compound interest":6.9,"credit card":2.8,"recession":9.7,"solvency":5.9,"taxable income":4.2,"bankruptcy":7.2 ...}

The scorePostProcessingConfig object contains general score thresholds and section-based boost values. It is defined like this (with default values):

"scorePostProcessingConfig": {
  "cutThresholds": {
    "cutByScore": 0.0,
    "cutByChildrenScore": 0.0
  },
  "sectionsBoost": {
    "TITLE": 3,
    "BODY": 1
  }
}

where:

  • cutThresholds (object): its properties may determine the removal of concept extractions if their score does not reach a specified threshold.

    • cutByScore (number): extractions with a confidence score lower than this are cut from the output of the model. Value 0 means no cut.
    • cutByChildrenScore (number): extractions are cut from the output if also descendant concepts are extracted and their score is higher than this threshold. Value 0 means no cut.
  • sectionsBoost (object): if the model has configured sections, this parameter allows boosting the confidence score of extractions based on the section in which the mentions of concepts were found.
    Its properties have a name corresponding to the name of a section and a numeric value which is used as a multiplication factor applied to the confidence score of any extraction due to some text in the section. Value 1 means no boost.
    The default value above is for a model with TITLE and BODY sections.

normalizeToConceptId

The normalizeToConceptId property of the customOptions object is a boolean that, when true, makes a thesaurus model add to its output extra data containing additional thesaurus information for extracted concepts.

namespace

The value of this option overrides that of the Rules output namespace functional property.

documentData

The documentData input key is optional and, when present, contains side-by-side information about the document that can be used by a symbolic model.

It is an array, each item of which represents one piece of information. The type of information is indicated by the mandatory type property:

  • disambiguation: a text token or a disambiguation which, for the text ranges indicated by positions (see below), overwrites the choices made by the model's text analysis.
  • entity: reserved for future use.
  • tag: a tag instance which, in the positions indicated by positions (see below), is added to any other tag instances that a CPK developed with Studio can produce via tagging rules or JavaScript and that the same CPK can exploit in categorization or extraction rules. More information about tag levels in the dedicated documentation. The tag instance seats on level 0.
  • annotation: reserved for future use.

If type is disambiguation, the item also has a disambiguationOptions object property.
The disambiguationOptions object has a mandatory property type which can be either token or semantic. If it's token, it's also the only property of the object and means that positions contains the ranges on one or more tokens that are alternative to those the the text analysis would find when tokenizing the text.
If type is semantic, instead, the remaining properties of the disambiguationOptions object specify an alternative disambiguation for the text ranges indicated in positions. These properties are:

  • baseForm (string): base form, that is the lemma
  • entityId (integer): a numeric ID of choice, used to identify any documentData disambiguation item referred to the same named entity
  • extraData: reserved for future use
  • parentSyncon (integer): identification number of the "parent" syncon in the Knowledge Graph

If type is tag, the item also has a tagOptions object property which in turn has these properties:

  • tag (string): name of the tag
  • value (string): optional value of the tag instance; if omitted, the values of the tag are the portions of text indicated in positions

Each item of the positions array is a characters range.
In the case of information of type disambiguation, if the sub-type is token, each range corresponds to a different token, if instead it is semantic they are occurrences of the concept.
In the case of tag type information, each range corresponds to an occurrence of the tag, possibly with the same value, if specified.
Each item of the array is an object with two properties, start and end, which must be valued with the same logic as the positions of output elements.

document

Blocks corresponding to ML models placed in the workflow in advanced mode expect an input JSON with top-level key document. This key is an object with the same structure as the output of a symbolic model.
The reason for this is that the block doesn't have NL Core, in only contains the prediction model. It doesn't expect a text to analyze, it expects the features of the text extracted by a NLU analysis of the text that is can't perform. Features are the basis of model's predictions.
Any upstream block with NL Core can be used to perform feature extraction, for example you can use the NLP Core knowledge model, then the document input property must be mapped to the key with the same name in the output of the feature extraction block.