Skip to content

Extraction peculiarities

This section describes the peculiarities of the rules language regarding the extraction task.

As written in the introduction, extraction consists in identifying and "pulling" useful data out of input documents.
A text intelligence engine based on technology:

  • Retrieves data from (usually) unstructured documents.
  • Can extract complex sets of data thanks to sophisticated mechanisms of prediction and recognition of unknown entities.
  • Can transform and normalize data to produce a highly-refined final value.

In other words, extraction involves recognizing and extracting unknown instances of well defined types of data, as well as aggregating and normalizing data in order to make them ready for further processing and, possibly, storage.

To facilitate the extraction task and provide a structure for the extracted data, the process is based on a templates system. Templates "attract" and aggregate data.
A template is composed of one or more fields, each representing a data type.

As mentioned in the introduction, when activated, an extraction rule fills the template field (or fields) declared inside the rule and generates a record.
A single rule can fill more than one field. Extracting more than one field in the same rule represents a strong relationship between fields and brings to the generation of a multi-field record. This mechanism is called by-rule aggregation.

On the other hand, the same field can be filled by more than one rule since several rules are often required to identify the field in all the expected forms defined by the project requirements. Field attributes and merge options merge basic records to reach a higher level of data aggregation.

A single rule can be activated multiple times by the input text, thus extracting multiple instances of the same value. In these cases, an automatic reduction mechanism called "bundling" combines the basic records, by effectively extracting a single value while preserving the single occurrences as lower-level detail data.

The output resulting from an input document processed by an extraction engine consists of one or more records per document, each containing the data extracted by means of extraction rules and aggregated using one of the available options or the by-rule aggregation technique. In these records, each extracted value is associated to its field, which is in turn associated to the template it belongs to. In other words, the table structure previously defined through the creation of template(s) and field(s) is returned in the output filled with the specific pieces of data retrieved in the document.

As an optional part in the extraction process, the data retrieved from a document can be kept in its original form or can be refined and manipulated through the use of different normalization or transformation approaches. Please see the dedicated pages for detailed information.