NL Core and CPKs
NL Core
What is it and what does it do?
NL Core is a complex document analysis module which has a fundamental role in Platform, both in the authoring application and in workflows' model blocks.
Its main function accepts either:
- Plain text
- Plain text plus sections' boundaries
- Sectioned text
- In the case of text taken from a PDF documents, text plus page layout information
plus:
- Implicit or explicit functional properties
- Possible custom options
as its input and returns in output lots of information about the text.
An NL Core instance for a given language:
-
(Always) Carries out Natural Language Understanding (NLU) of the input text returning numerous features:
-
Of the entire document:
-
Of individual text token and sub-tokens (atoms):
- Lexical, morphological and semantic information about text tokens and atoms
- Named entities with open data
-
Syntactic and semantic relationships:
- Syntactic dependencies between text tokens
- Verb-centered relations between tokens
-
-
(In the case of the text of PDF documents uploaded with the PDF document view option or coming from Extract in an NL Flow workflow) Provides layout, typographical and geometric information on the text as it's displayed in the source PDF.
- (If accompanied by symbolic, human readable, rules of one or more of these types: tagging, segmentation, categorization, information extraction) Compares the conditions of the symbolic rules, made up of operands that refer to particular features of the text, with the actual features of the text extracted as in the previous points.
When a match is found, rules are triggered and give their contribution to superimpose tags to portions of the text, identify segments of text with peculiar characteristics and predict categories and/or extractions.
Rules can me either generated automatically as a result of model training or be handwritten when models are created or manipulated with Studio. - (If accompanied by JavaScript event handling functions) Uses the information referred to in the previous points—features, tags, segments and predictions— in any useful way both to post-process the output—like for example computing the score of thesaurus models' extractions with a non default algorithm—or to enrich it with derived extra data.
JavaScript code can be either generated automatically and put inside the model at the end of the training phase or be handwritten when models are created or manipulated with Studio.
How does it work and what is it made of?
The algorithm performing NLU (see point 1 in the previous paragraph) is based on a knowledge graph, which can be the factory one for the given language or the result of a customization project. The recognition of named entities and their types is an exception because for all entities not present in the knowledge graph it is based on heuristics.
Any tagging, segmentation, categorization and extraction rules are evaluated by an algorithm which compares their conditions with the information found by the NLU algorithm or produced by previously evaluated rules: if the condition of a rule matches characteristics of portions of the text, it triggers and superimposes a tag to a portion of text, identifies a segment of text with peculiar characteristics, gives point to a candidate category or extracts instances of a class of information.
All rules of the same type are evaluated simultaneously, first the tagging rules, then the segmentation rules and finally, together, the categorization and extraction rules.
This allows having segmentation rules that can leverage tags and categorization or extraction rules that can also match tags and segments identified by the rules of the previous types.
Rules can exploit layout information, for example it is possible to have rules that are triggered only if a text with certain features is found in a title or in a table cell.
Any JavaScript is written in the functions that handle the main events of the input processing pipeline. For example, the onCategorizer
function fires after the categorization rules are evaluated. Its code can post-process the results of that phase, also using convenient predefined objects that NL Core makes available.
Another function, onFinalize
is triggered when finalizing the output and can be used for any manipulation of it and also to produce additional, extra output.
In the main script it is possible to import JavaScript modules and use their functions.
The components of an NL Core instance are therefore:
- In all cases, the knowledge graph, other ancillary data and the code libraries.
-
Any symbolic rules plus the corresponding resources, such as the category tree in cases of categorization models or the definition of information classes in extraction models.
Note
A model created with Studio can do both, categorize and extract, while an explainable AI model generated with the Platform's authoring application only performs the task corresponding to the project type.
-
Any JavaScript code, framed in event handling functions, and potentially integrated by modules used as code libraries.
Where is it used?
Instances of NL Core are used in all the fundamental operations of the Platform's authoring application and inside model blocks in NL Flow workflows.
Basic instances of NL Core—basic because without rules and script—for all the supported languages, constitute the tech versions, so they are used whenever the authoring application resorts to a tech version for project activities.
When an NL Core instance of a tech version is used to train a model during an experiment, it is then embedded in the model itself.
In a ML model, it is an exact copy of the tech version's instance, so it doesn't have any rules or script. It becomes a part of the model, that is the module responsible for extracting text features, the other part being the ML algorithm that bases its predictions upon those features.
Explainable AI models, instead, are copies of tech version's NL Core instances enriched with the rules, the resources and any script automatically generated during the training.
So, a symbolic model is an enriched, specialized instance of NL Core while a ML model contains a basic NL Core instance.
Info
In NL Flow workflows, advanced mode ML model blocks don't use their NL Core instance because they rely upon another upstream model for feature extraction.
CPKs
A CPK is an NL Core package and has two forms:
1) When it's the result of exporting an explainable model from the authoring application or deploying a Studio project, it is an archive file with the .cpk
extension.
2) When loaded in Platform, it coincides with a symbolic model or a tech version's instance of NL Core.
Please note that also ML models use NL Core, so the CPK is part of the .mlpk
archive file produced when exporting a ML model.
So, any model either contains a CPK or is a CPK. However, when a ML model is used in a workflow in advanced mode, its CPK is not used.