NL Core
NL Core is a software module which performs Natural Language Understanding (NLU) analysis of an input text in a given language to produce symbolic information (morphological, lexical, semantic, syntactic, etc.) about text tokens.
NL Core is the NLU engine of basic mode ML model blocks where it performs feature extraction. The prediction model is then fed with the text features.
When provided with human-readable symbolic rules—rules using symbolic information identified by NLU analysis as their operands—NL Core can also predict categories or extractions by itself. Each symbolic model is in fact NL Core equipped with rules that were either automatically generated in the Platform authoring application when the model was trained or hand written with Studio.
NL Core also has JavaScript code execution capabilities. JavaScript can be used to affect document analysis pipeline and Studio lets you write this code.
CPK
CPK stands for (NL) Core Package. Symbolic models and CPKs are essentially the same thing: NL Core plus symbolic rules and, if the model was generated with Studio, any JavaScript code.
When a CPK is produced in the Platform authoring application as a result of a Boostrapped Studio Project, Explainable Categorization, Explainable Extraction or Thesaurus generation experiment, symbolic rules are automatically generated during model training and are all of the same kind: either categorization or extraction, based on the project type. This means that a Platform generated symbolic model can predict categories or extractions, but not both.
The CPK interchange file format, for export and import, has the .cpk
file name extension.
Studio
Studio can be used to edit and enrich Platform generated CPKs or to create models from scratch. Modified and new models can then be used in Platform to make experiments with document libraries in the context of an authoring project and/or as components of NL Flow workflows.
CPK features that can only be implemented with Studio are:
- Make full use of the language: the rules' language is very rich and allows managing the most varied situations by creating a very thorough customization of the model. Instead, the vocabulary that can be used when automatically generating rules during a Platform experiment is relatively small.
-
The model can contain both categorization and extraction rules and so be able to predict categories and extractions at the same time. A model of this type:
- When used for experiments in the authoring application, will give only the results corresponding to the project type: categories if imported in a categorization project, extractions if imported in an extraction or thesaurus project.
- When used in a NL Flow workflow, will return both categories and extractions inside the same output JSON.
-
Optional segmentation rules: they detect and output text segments, that is parts of the document with distinctive features, like the list of the parties in a contract.
Segments are useful output by themselves, but since NL Core evaluates segmentation rules before categorization and extraction rules, segments can also be used as the scope of other rules. This allows, for example, the prediction of a an extraction only if the corresponding text occurs in a given segment of text.
CPK segments can also be used for sub-document categorization in basic mode and advanced mode ML models. -
JavaScript code: Studio allows writing JavaScript code to be executed in the key phases of the document analysis pipeline. Examples of what can be achieved with scripting are:
- Pre-processing the input text before analyzing it
- Post-processing the results
- Produce extra data
-
Extract Converter output management: any model using NL Core can accept as its input document layout information, produced by an Extract Converter processor, instead of plain text, but while Platform generated models can "see" only the text inside this information, by writing rules by hand in Studio it is possible to exploit layout information with special language constructs, for example to extract information only from the cells of a table or to predict a category only if a given text is written inside a heading.
- Custom options: the behavior of any CPK can be variously influenced by the functional parameters of the corresponding workflow block and by standard options that can be added to the input JSON. Also custom options can be specified in the input JSON, but only the JavaScript code inside thesaurus models and any code added with Studio can read them and use their value to affect the behavior of the CPK.