Input for model blocks
First level keys
The top-level keys of the input JSON that a model block recognizes and can use depend on the presence of NL Core inside the model. If the model has this component, as it is in the case of symbolic models and basic mode ML models, it always recognizes these keys:
text
(string)sections
(array)sectionsText
(object)documentLayout
(object)options
(object)
If the symbolic component is based on NL Core version 4.12 or later, the block also recognizes this key:
-
documentData
(array)Tip
You can determine the version of NL Core for a symbolic model by selecting Show resources in the editor or looking at the Resources area after selecting the model in the Models view of the main dashboard.
In general, the block always expects a text to analyze, so one key between text
, sectionsText
and documentLayout
is mandatory (see details below), while the other keys are optional.
Advanced mode ML models don't have NL Core and the only input key they recognize is:
document
(object)
In this case the block doesn't expect a text to analyze: instead it expects text features, that is the outcome of the NLU analysis of a text.
text
text
is text that must be analyzed by NL Core.
When input mapping is needed, this key is typically mapped, through the corresponding text input property, to:
- The
modelName.document.content
key of another model block. - The
content
key of a TikaTesseract Converter processor or a URL Converter processor block.
text
is alternative to documentLayout
: if one of these keys is present in input, the other must be omitted.
text
can be complemented by sections
and sectionsText
for Studio-generated symbolic models whose rules can distinguish between text sections.
documentLayout
documentLayout
is an object with the same structure of the result
key of Extract Converter processor output, so a model using it is typically preceded by an Extract Converter block and this key is mapped through the corresponding documentLayout input property to that output key.
It must be used for Studio-generated symbolic models with rules that leverage layout information and for extraction ML models trained with layout-based annotations.
Note
Any model with NL Core recognizes this key and is able to derive plain text to analyze from it, but there is no point in passing layout information to a model that is not specialized to leverage it.
If documentLayout
is present in the input JSON, text
, sections
and sectionsText
—which are alternative means of giving input text to the block—must be omitted.
sections
The sections
key is optional and complementary to text. When present, it indicates the boundaries of text sections, for example:
"sections": [
{
"name": "TITLE",
"start": 0,
"end": 61
},
{
"name": "BODY",
"start": 62,
"end": 2407
}
]
Currently only symbolic models designed with Studio can contain hand-written symbolic rules that account for sections. In particular, with multiple sections, rules can be written that are triggered only by the text of a given section, while Platform generated rules have all the same scope—even if the input document has sections—that is the entire input text.
sections
is an array. Each item corresponds to a section and it's an object with these properties:
name
: section name.start
: zero-based position of the first character in the section inside the value oftext
.end
: zero-based position of the first character after the section inside the value oftext
.
If input mapping is needed, the expected mapping of the corresponding sections input property is mapped to a key of the workflow input or the modelName.document.sections
property of an upstream model block which in turn received sections data.
sectionsText
The sectionsText
key is text to be analyzed divided into sections, for example:
"sectionsText": [
{
"name": "TITLE",
"text": "This is a title"
},
{
"name": "BODY",
"text": "This is the body"
}
]
sectionsText
is an array of objects. Each object has these properties:
name
: section nametext
section text
The model builds plain text to analyze by concatenating the values of the text
properties of the array items using a newline character as a separator.
If the text
key is also set, the text obtained from sectionsText
is appended to the one represented by text
using a newline character as a separator, so the model receives a text that is the result of the concatenation of two texts. The model also receives automatically computed section boundaries referred to the concatenated text.
For example:
-
Value of
text
:We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty.
-
Value of
sectionsText
:[ { "name": "TITLE", "text": "President John F. Kennedy delivered his inaugural address" } ]
-
Concatenated plain text:
We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty. President John F. Kennedy delivered his inaugural address
-
Sections boundaries:
- Section name:
TITLE
- Start: 142
- End: 199
- Section name:
When input mapping is needed, the corresponding sectionsText input property is mapped to one key of the workflow input or to the modelName.document.sectionsText
property of a model block which in turn received that data.
options
Theoptions
object contains optional parameters that can be passed to the model to influence its behavior. They mainly affect NL Core.
The most extensive structure that this object can have is this:
"allCategories": boolean,
"custom": object,
"disambiguation": {
"flags": number
},
"output": object,
"rules": object
or, for old models, this:
"allCategories": boolean,
"custom": object
Old models have NL Core version 4.11 or lower.
Tip
You can determine the version of NL Core for a symbolic model by selecting Show resources in the editor or looking at the Resources area after selecting the model in the Models view of the main dashboard. For basic mode ML models, the version of NL Core is tied to that of the ML engine, which is visible when you select the model from the list.
All the components of this structure are optional; they are described below.
allCategories
Retained for backwards compatibility, this option is equivalent to the allCategories
property of the rules
object.
custom
Retained for backwards compatibility, this option is equivalent to the customOptions
property of the rules
object.
disambiguation
This is an advanced option for NL Core.
It is meant to be used with the support of your expert.ai technical contact should he determine that the tuning of low-level options can improve the quality of NLU analysis.
When used, this option contains, in its only flags
parameter, a number representing one or more disambiguation options. Multiple options are combined in binary OR.
output
The most extensive structure that this object can have is the following:
"output": {
"analysis": string array,
"features": string array,
"knowledgeProperties": string array
}
All the components of this structure are optional.
These options affect the output of NL Core. The properties specified for this object override the values of corresponding functional properties of the model block. These are the correspondences:
-
analysis
array items:The presence of an item in the
analysis
array is equivalent to turn on the corresponding functional property.Item value Functional property relevants Output relevants sentiment Output sentiment relations Output relations segments Output segments -
features
array items:The presence of an item in the
features
array is equivalent to turn on the corresponding functional property.Item value Functional property syncpos Synchronize positions to original text dependency Output dependency tree knowledge Output knowledge externalIds Output external ids extradata Output rules extra data explanations Output explanations namespaces Output namespace metadata documentData Output document data layout Output layout information -
knowledgeProperties
array: this array replaces the value of the Required user properties for syncons functional property.
rules
The most extensive structure that this object can have is the following:
"rules": {
"allCategories": boolean,
"applyRules": boolean,
"customOptions": object,
"namespace": string
}
All the components of this structure are optional; they are described below.
allCategories
When its value is false
, a categorization model returns only the categories with the highest scores, that is those with the winner
property set to true. The default value is true.
applyRules
The value of this option overrides that of the Apply rules functional property.
customOptions
This object can be used to convey custom options to Studio-generated symbolic models and thesaurus models that access them via specific JavaScript code.
scoreConfig
Thesaurus models are based on NL Core and contain automatically generated JavaScript code that implements a scoring algorithm that can affect the confidence score of the extracted concept.
The scoring algorithm is based on a configuration which can be changed with the scoreConfig
property of customOptions
.
The default configuration for the scoring algorithm corresponds to this scoreConfig
object:
"scoreConfig": {
"disableScore": false,
"defaultScore": 1,
"normalize": 100,
"boostByHierarchy": {
"byParent": 1,
"byChildren": 0.5,
"byRelated": 0.3
},
"boostByFrequency": true,
"boostByLabel": {
"matchPrefLabel": 1,
"matchAltLabel": 0.5,
"lengthMeasure": 0.1,
"ignoreCase": true
}
}
If you pass one or more of the configuration settings above, with non default values, to the model, you affect the scoring algorithm.
These are the properties of the scoreConfig
object:
disableScore
(boolean, default value false): if true, a Studio-like scoring algorithm is used. All the other options are ignored, so you can omit them.defaultScore
(number, default value 1): default base score for all extractions. Ignored ifboostByFrequency
is true.normalize
(number, default value 100): the final score of extraction will be normalized to a value in the range between 0 and the value of this parameter. Use value 0 to disable score normalization.-
boostByHierarchy
: this property is an object whose properties are multiplication factors that are applied to the base score based on the relationship between the extracted concept and other concepts in the thesaurus.byParent
(number, default value 1): applied for every broader concept that is also extracted from the text. Value 0 is interpreted as no multiplication.byChildren
(number, default value 0.5): applied for every narrower concept that is also extracted from the text. Value 0 is interpreted as no multiplication.byRelated
(number, default value 0.3): applied for every non-hierarchically related concept that is also extracted from the text. Value 0 is interpreted as no multiplication.
-
boostByFrequency
(boolean, default value true): when true, the base score is the concept frequency in the text. -
boostByLabel
: this property is an object whose properties determine how the base score is affected by the relationship between the extracted text and the concept labels.matchPrefLabel
(number, default value 1): multiplication factor applied to the base score if the matching text is the preferred label.matchAltLabel
(number, default value 0.5): multiplication factor applied to the base score if the matching text is one of alternative labels.lengthMeasure
(number, default value 0.1): multiplication factor applied to the base score that is further multiplied by the number of tokens—separated by space—of the match.ignoreCase
(boolean, default value true): when true, the case is ignored when matching the text and the labels of the concept.
normalizeToConceptId
The normalizeToConceptId
property of the customOptions
object is a boolean that, when true, makes a thesaurus model add to its output extra data containing additional thesaurus information for extracted concepts.
namespace
The value of this option overrides that of the Rules output namespace functional property.
documentData
The documentData
input key is optional and, when present, contains side-by-side information about the document that can be used by a symbolic model.
It is an array, each item of which represents one piece of information.
The type of information is indicated by the mandatory type
property:
disambiguation
: a text token or a disambiguation which, for the text ranges indicated bypositions
(see below), overwrites the choices made by the model's text analysis.entity
: reserved for future use.tag
: a tag which, in the positions indicated bypositions
(see below), is added to any other tags that a CPK developed with Studio can produce via tagging rules or JavaScript and that the same CPK can exploit in categorization or extraction rules.annotation
: reserved for future use.
If type is disambiguation
, the item also has a disambiguationOptions
object property.
The disambiguationOptions
object has a mandatory property type
which can be either token
or semantic
. If it's token
, it's also the only property of the object and means that positions
contains the ranges on one or more tokens that are alternative to those the the text analysis would find when tokenizing the text.
If type
is semantic
, instead, the remaining properties of the disambiguationOptions
object specify an alternative disambiguation for the text ranges indicated in positions
. These properties are:
baseForm
: base form, that is the lemmaentityId
: a numeric ID of choice, used to identify anydocumentData
disambiguation item referred to the same named entityextraData
: reserved for future useparentSyncon
: identification number of the "parent" syncon in the Knowledge Graph
If type is tag
, the item also has a tagOptions
object property which in turn has these properties:
tag
: name of the tagvalue
: optional value of the tag; if omitted, the values of the tag are the portions of text indicated inpositions
Each item of the positions
array is a characters range.
In the case of information of type disambiguation
, if the sub-type is token
, each range corresponds to a different token, if instead it is semantic
they are occurrences of the concept.
In the case of tag
type information, each range corresponds to an occurrence of the tag, possibly with the same value, if specified.
Each item of the array is an object with two properties, start
and end
, which must be valued with the same logic as the positions of output elements.
document
Blocks corresponding to ML models placed in the workflow in advanced mode expect an input JSON with top-level key document
. This key is an object with the same structure as the output of a symbolic model.
The reason for this is that the block doesn't have NL Core, in only contains the prediction model. It doesn't expect a text to analyze, it expects the features of the text extracted by a NLU analysis of the text that is can't perform. Features are the basis of model's predictions.
Any upstream block with NL Core can be used to perform feature extraction, for example you can use the NLP Core knowledge model, then the document input property must be mapped to the key with the same name in the output of the feature extraction block.