Tech version and NLU analysis
In the Platform authoring application, the Tech version underlying any project is a collection of NL Core software services, at least one instance of NL Core for each language supported by Platform installation, but there may be more instances per language if the application owner, at installation time, deemed that a greater processing capacity for a given language was required.
Each instance of NL Core provides two essential capabilities for its language:
- The basic knowledge graph.
- A Natural Language Understanding (NLU) analysis function.
The NLU analysis is based on the knowledge graph and is performed automatically on the text of each document during the upload procedure or after changing a document's language.
It obtains various types of information (lexical, linguistic, semantic, thematic) from the text which is then used to:
- Index documents to allow users to search and drill-down to easily explore libraries and corpora and make discoveries about documents.
- In extraction projects, enable automatic propagation of annotations and active learning.
- In thesaurus projects, make it possible to suggest broader, narrower or related labels and concepts.
- Train models (analysis output information represents the features of the documents) during experiments.
Note
An explainable AI model is called CPK because the file in which it can be exported and from which it can be imported has the .cpk extension which stands for "NL Core package", since in addition to the categorization or extraction rules, it also contains all the NL Core software files—including the knowledge graph—for the model language.
Among the information that NLU analysis extracts from the text, the following are exposed to users who can use them as facets for faceted search and drill-down of documents:
- Named entities
- Main Phrases: the phrases that the NLU analysis deemed most representative of the entire text.
- Keywords: literal values of text tokens, exactly as written.
- Lemmas: all common nouns, proper nouns, adjectives and verbs. The lemma or base form is displayed, that is the form used as the name of an entry in a vocabulary or in an encyclopedia, therefore in the list there's only one item for all the inflected forms of the same lemma. For example, lemma go represents the words go, goes, went, etc., lemma scarf represents scarf and scarves.
- Syncons: "syncon" is expert.ai terminology and defines a node in the knowledge graph, which corresponds to a concept, or a concept heuristically recognized s a "type of" a concept modeled in the knowledge graph. Every syncon is represented by a lemma. For a knowledge graph node, it's the lemma that is mostly used to express the concept, so for example if the text contains the word inexpensive, the item in the list is cheap, because cheap is the synonym of inexpensive (so the words have the same meaning, it's the same concept) that, statistically, is used most of the times to express the concept.
- Main Lemmas: the lemmas (see above) that the NLU analysis deemed as more representative of the whole text.
- Main Syncon Labels: compact descriptions of the main concepts based on their ancestry in the IS-A hierarchy.
- Main Topics: inside the knowledge graph syncons can be associated with one or more topics. The NLU analysis determines the concepts expressed in the text and derives the topics from the corresponding syncons, then choosing the main topics based on their frequency and association with the main concepts.