Welcome to the expert.ai languages reference guide.
expert.ai languages are the programming languages used to write the "source code" for a unique type of software: text intelligence engines.
Text intelligence engines can be programmed to perform automatic categorization, automatic extraction or both.
There are two expert.ai languages:
- rules language
- scripting language
The core of text intelligence engines is written in the rules language, while scripting is optional and to be used to control and extend the process workflow.
The rules language is a declarative language because it doesn't implement algorithms in explicit steps. Essentially, the source code is a set of rules.
Document classification is the activity that a human mind or a computer program performs whenever a decision needs to be taken or a task needs to be executed based on the document type or on the topics covered by the document.
For example, in order to give a proper answer to a customer's request, the type of request must be determined.
Text intelligence engines classify documents with a process called categorization, that strives to determine what a document/text1 is about. The possible categories are called domains and the entire set of domains the engine is able to recognize is called taxonomy.
The core of the categorization process is to compare categorization rules with the text. Each rule in made of a condition and a domain: if the text satisfies the rule's condition, the domain receives a certain amount of scoring points.
At the end of the process, the domains with received points constitute the categorization outcome.
An example of categorization by comparison is the recognition of military aircraft during the Second World War. To determine whether the planes were friends or foes, military personnel had playing cards, tables and posters showing the silhouettes of military planes. Comparing the shapes in their cards with those of the airplanes flying over them, they could "categorize" airplanes as either "good" or "bad" and, in the latter case, raise the alarm.
U.S. Air Force photo by Ken LaRock, VIRIN: 170921-F-IO108-001.JPG
Information extraction is the activity that a human mind or a computer program performs to detect needed data in a document.
For example, to correctly associate a customer's request with the customer's record, customer identification data like first name, last name or customer code must be extracted from the request.
Text intelligence engines perform information extraction to retrive useful data out of a document/text by comparing extraction rules with the text.
As for categorization rules, extraction rules contain a condition, but instead of a domain, they are associated with a data template: if the text satisfies a rule's condition, the text that matches the condition as a whole or parts of it is transferred—verbatim or after a value normalization—in template fields that together constitute and extraction record.
At the end of the process, records constitute the extraction outcome.
As anticipated above, a rule is a combination of a condition and an action.
The action is performed whenever the condition is met: in categorization the action is: "increase the score of the associated domain by N points", while in extraction it is: "fill data template fields with the text matched by the condition or by sub-conditions".
A condition can be equated to a rigid or elastic shape. The disambiguator—the text analysis module at the heart of expert.ai technology—transforms the input text into a sequence of tokens, each enriched with all the attributes that the disambiguator has identified during its analyses. The text intelligence engine takes each defined rule and "superimposes" its condition (the aforementioned "shape") to the token stream. Whenever the "shape" fits perfectly a portion of the stream or, less commonly, the entire stream, the condition is satisfied and the rule's action is performed. It can be said that the input tokens trigger or activate the rule.
A text intelligence engine's source code can thus be considered as a collection of possibly overlapping shapes which capture clues topics clues or interesting data.
Like the World War II "spotter cards" mentioned above, or swatches used to match paint color, the larger the assortment, the greater the ability to recognize interesting cases. It is important, however, to properly size the assortment so that time and money are not wasted on "silhouettes" or "colors" (shapes, rules) which are impertinent to the needs of the project.
The condition of a rule can also be seen as a "text template" or "text model": if the template/model matches the input text, then the rule is activated.
In general, there is no explicit order of evaluation of the rules of the same type; it is as if all defined rules get evaluated simultaneously. The only exceptions to this principle are sub-rules, which, if defined, are necessarily evaluated before rules.
Other types of rules and the analysis pipeline
A programmer can also define tagging and segmentation rules. Rules of these types do not categorize or extract, but their results—tags and segments—can be exploited in categorization and extraction rules.
The rules evaluation order is:
- Tagging rules
- Segmentation rules
- Categorization & extraction rules
Rules evaluation is the core of the analysis pipeline (see the picture below) that the text intelligence engine executes for every input document.
Rules language features and this book
There are language constructs which are specific to the categorization activity and there are those which are specific to extraction activity. However, the rules language also has many features which are common to both activities.
This book has been divided into sections according too these commonalities and peculiarities. Segmentation rules are dealt with together with the common features, while a specific section is dedicated to the tagging rules.
A script allows the programmer to control and extend the text intelligence engine document analysis pipeline, because it can perform powerful actions after crucial steps of the process, from the preparation of input text to the finalization of results.
Scripting support is being developed and will be fully available in future releases of Studio. Find more information on how it will work in the dedicated article of this book.
Throughout this guide the terms "document" and "text" are used interchangeably because document is intended as "the document's text".
In light of this interpretation, a document can be:
- Any plain text file.
- The text of a "textual" PDF file.
- The text taken from an image like a scanned document or from a "visual" PDF file via OCR.
- The caption of a photograph.
- The transcript of a phone call.
- A line from a chat.
- The text of an e-mail message.
- A "tweet".
- A post in a forum or a comment to a post.
- The value of a field in a screen form. ... in summary: any string of characters from any source