Skip to content languages

The languages

Welcome to the languages reference guide. languages are the programming languages used to write the "source code" for a unique type of software: text intelligence engines.

Text intelligence engines can be programmed to perform automatic categorization, automatic extraction or both.

There are two languages:

  1. rules
  2. scripting

The core of text intelligence engines is written in the rules language, while scripting is an optional language to be used to control and extend the process workflow.

The rules language is a declarative language because it doesn't implement algorithms in explicit steps. Essentially, the source code is a set of categorization and/or extraction rules.


Categorization is the process that strives to determine what a document/text1 is about.

It is the activity that a human brain or a computer program executes when a decision needs to be made or when a task needs to be performed based on the type of contents within a document/text or on the topics covered by the document/text itself.

For example, in order to give a proper answer to a customer's request, the request must first be categorized to determine what type of request it is.

Text intelligence engines which are designed to automatically categorize input texts are programmed to recognize all of the documents' types or topics of interest. These topics or types are called domains. The entire set of domains the engine is able to recognize is called the project taxonomy.

The core of the categorization activity is to compare all the categorization rules with the text. Each rule has an associated domain: if the comparison between the rule and the text has a positive outcome, the score of the domain is increased. At the end of the process, the domain (or the domains) with the highest cumulative score constitute the categorization outcome.

An example of categorization by comparison is the recognition of military aircraft during the Second World War. To determine whether the planes were friends or foes, military personnel had playing cards, tables and posters showing the silhouettes of military planes. Comparing the shapes in their cards with those of the airplanes flying over them, they could "categorize" airplanes as either "good" or "bad" and, in the latter case, raise the alarm.

Spotter cards U.S. Air Force photo by Ken LaRock, VIRIN: 170921-F-IO108-001.JPG


Extraction is the process that strives to retrive useful data out of a document/text.

It is the activity that a human brain or a computer program executes whenever it is known or suspected that a document contains the data needed to make a decision or perform a task.

For example, to correctly associate a customer's request with the customer's record, some identifying data like a code, first name, last name, birth date etc, must be extracted from the request.

As with categorization, the core of the extraction activity is to compare all the extraction rules within the text. Each rule has an associated data template: if the comparison between the rule and the text has a positive outcome, an instance of the data template - a record - is filled with the matching text or with a normalized representation of it. At the end of the process, records constitute the extraction outcome.


A rule is a combination of a condition and an action.

The action is performed whenever the condition is met: in categorization this means, "increase domain X score by N points" while in extraction it means, "fill data template Y with tokens data".

A condition can be equated to a rigid or elastic shape. The disambiguator - the text analysis module at the heart of technology - transforms the input text into a sequence of tokens, each enriched with all the attributes that the disambiguator has identified during its analyses. The text intelligence engine takes each defined rule and "superimposes" its condition (the aforementioned "shape") to the token stream. Whenever the "shape" fits perfectly a portion of the stream or, less commonly, the entire stream, the condition is satisfied and the rule's action is performed. It can be said that the input tokens trigger or activate the rule.

The condition of a rule can also be seen as a "text template" or "text model": if the template/model matches the input text, then the rule is activated.

There is no explicit order of evaluation of the rules; it is as if all defined rules are being evaluate simultaneously. The only exceptions to this principle is when the IMPORT statement, segments or sub-rule are used. In these cases order of the rules within the source code becomes relevant (please see the dedicated topics for more detail).

A text intelligence engine's source code can be considered as multiple (and possibily) overlapping shapes which capture the possible clues regarding topics or data of interest.

Like the World War II "spotter cards" mentioned above or swatches used to match paint color, the larger the assortment, the greater the ability to recognize interesting cases. It is important, however, to properly size the assortment so that time and money are not wasted on "silhouettes" or "colors" (shapes, rules) which are impertinent to the needs of the project..

Language features

There are language constructs which are specific to the categorization activity and there are those which are specific to extraction activity. However, the rules language also has many features which are common to both activities.
This guide has been divided into sections according too these commonalities and peculiarities.

  1. Throughout this guide the terms "document" and "text" are used interchangeably because document is intended as "the document's text".

    In light of this interpretation, a document can be:

    • Any plain text file.
    • The text of a "textual" PDF file.
    • The text taken from an image like a scanned document or from a "visual" PDF file via OCR.
    • The caption of a photograph.
    • The transcript of a phone call.
    • A line from a chat.
    • The text of an e-mail message.
    • A "tweet".
    • A post in a forum or a comment to a post.
    • The value of a field in a screen form. ... in summary: any string of characters from any source