Input for model blocks
First level keys
The top level keys of the input JSON that a model block is able to interpret are:
text (string)
sections (array)
sectionsText (object)
documentLayout (object)
options (object)
Some keys are mutually exclusive, because they represent alternative ways of supplying the text to be parsed to the model block.
The sections
key can only be present in combination with text
while sectionsText
and text
can be used alone or in combination.
text
text
is the text that is analyzed by the model.
This input property is usually mapped to:
- The
modelName.document.content
key of a model block's output. - The
content
key of TikaTessarct Converter processor output. - The
content
key of URL Converter processor output.
text
is alternative to documentLayout
: if one of these keys is present, the other must be omitted.
text
can be complemented by sections
and sectionsText
for symbolic models that use text sections.
documentLayout
documentLayout
is an object with the same structure of the result
key of Extract Converter output, so a model using it is often preceded by an Extract Converter block and this input property is mapped to that output key.
It must be used for symbolic models needing layout information and for extraction ML models trained with layout-based annotations. The model can parse the object to take the text to analyze plus original document's graphical layout information.
If documentLayout
is present in the input JSON, text
, sections
and sectionsText
—which are alternative means of giving input text to the block—must be omitted.
sections
The sections
key is complementary to text and indicates the boundaries of text sections for symbolic models that can leverage this information, for example:
"sections": [
{
"name": "TITLE",
"start": 0,
"end": 61
},
{
"name": "BODY",
"start": 62,
"end": 2407
}
]
Currently only symbolic models built with Studio can account for sections.
sections
is an array. Each item corresponds to a section and it's an object with these properties:
name
: section name.start
: zero-based position of the first character in the section inside the value oftext
.end
: zero-based position of the first character after the section inside the value oftext
.
The expected mapping of the corresponding input property is with the workflow input or the modelName.document.sections
property of a model block which in turn received sections data.
sectionsText
The sectionsText
key represents text to be analyzed divided into sections, for example:
"sectionsText": [
{
"name": "TITLE",
"text": "This is a title"
},
{
"name": "BODY",
"text": "This is the body"
}
]
sectionsText
is an array of objects. Each object has these properties:
name
: section nametext
section text
The model builds plain text to analyze by concatenating the values of the text
properties of the array items using a newline character as a separator.
If the text key is also set, the text obtained from sectionsText
is appended to the one represented by text
using a newline character as a separator, so the model receives a text that is the result of the concatenation of two texts. The model also receives automatically computed section boundaries referred to the concatenated text.
For example:
-
Value of
text
:We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty.
-
Value of
sectionsText
:[ { "name": "TITLE", "text": "President John F. Kennedy delivered his inaugural address" } ]
-
Concatenated plain text:
We shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and success of liberty. President John F. Kennedy delivered his inaugural address
-
Sections boundaries:
- Section name:
TITLE
- Start: 142
- End: 199
- Section name:
The expected mapping of the corresponding input property is with the workflow input or the modelName.document.sections
property of a model block which in turn received sections data.
options
Theoptions
object corresponds to optional parameters that can be passed to the model to influence its behavior.
allCategories
The allCategories
property of the options
object is a boolean. When its value is false
, a categorization model returns only the categories with the highest scores.
custom
The custom
property of the options
object is an object that can be used to convey options to the JavaScript code inside a symbolic model.
Thesaurus models contain such a code, which is automatically generated together with the model.
It is also possible to insert this kind of code in Studio projects, so that the code is then included in the project model.
The JavaScript code uses the getOptions method of the predefined CTX
object to access the options.
scoreConfig
With the scoreConfig
property of the custom
object it is possible to customize the scoring algorithm of thesaurus models.
For example:
"options": {
"custom": {
"scoreConfig": {
"disableScore": false,
"defaultScore": 1,
"normalize": 100,
"boostByHierarchy": {
"byParent": 1,
"byChildren": 0.5,
"byRelated": 0.3
},
"boostByFrequency": true,
"boostByLabel": {
"matchPrefLabel": 1,
"matchAltLabel": 0.5,
"lengthMeasure": 0.1,
"ignoreCase": true
}
}
}
}
These are the properties of the scoreConfig
object:
disableScore
(boolean, default value false): if true, a Studio-like scoring algorithm is used. All the other options are ignored, so you can omit them.defaultScore
(number, default value 1): default base score for all extractions. Ignored ifboostByFrequency
is true.normalize
(number, default value 100): the final score of extraction will be normalized to a value in the range between 0 and the value of this parameter. Use value 0 to disable score normalization.-
boostByHierarchy
: this property is an object whose properties are multiplication factors that are applied to the base score based on the relationship between the extracted concept and other concepts in the thesaurus.byParent
(number, default value 1): applied for every broader concept that is also extracted from the text. Value 0 is interpreted as no multiplication.byChildren
(number, default value 0.5): applied for every narrower concept that is also extracted from the text. Value 0 is interpreted as no multiplication.byRelated
(number, default value 0.3): applied for every non-hierarchically related concept that is also extracted from the text. Value 0 is interpreted as no multiplication.
-
byFrequency
(boolean, default value true): when true, the base score is the concept frequency in the text. -
byLabel
: this property is an object whose properties determine how the base score is affected by the relationship between the extracted text and the concept labels.matchPrefLabel
(number, default value 1): multiplication factor applied to the base score if the matching text is the preferred label.matchAltLabel
(number, default value 0.5): multiplication factor applied to the base score if the matching text is one of alternative labels.lengthMeasure
(number, default value 0.1): multiplication factor applied to the base score that is further multiplied by the number of tokens—separated by space—of the match.ignoreCase
(boolean, default value true): when true, the case is ignored when matching the text and the labels of the concept.
You can omit the properties whose default value if fine for you. If all the default values are fine, you can omit scoreConfig
and if you don't need to specify other options you can omit the options
property altogether.
normalizeToConceptId
The normalizeToConceptId
property of the custom
object is a boolean that, when true, makes a thesaurus model add to its output additional thesaurus information for extracted concepts.
For example:
"options": {
"custom": {
"normalizeToConceptId": true
}
}