Merge option

The merge mechanism

The merge option determines the merging of simple extraction records into compound records. Simple records disappear in the merge, so only compound records are returned as engine output.

The syntax is:

TEMPLATE(templateName)
{
    @fieldName_1,
    @fieldName_2,
    ...
    @fieldName_n

    MERGE WHEN scope
}

MERGE and WHEN are language keywords and must be written in uppercase.
A template can have at most one merge option.

scope defines the range of the merge, that is the portion of text that has to be considered as the source of simple records to be merged. If, for example, the scope is SENTENCE, all simple records originating from the same sentence will be merged in a compound record, i.e., a separate compound record is generated for every sentence from which something was extracted. Instead, if the scope is DOCUMENT, all simple records extracted from the whole document will be merged into one (literally, just one, and potentially large) compound record.

Note

Merging represents a notable exception to the template-table similarity, because it creates compound records that can contain more than one instance of the same field.

Possible scopes are:

SENTENCE
CLAUSE
PARAGRAPH
SEGMENT
SECTION
DOCUMENT

The scopes are illustrated in the next section of this topic.

Merging is useful when extracted data can be considered related, because it's located in the same portion of text.

The followings are examples of the merge option used with different scopes.
Consider this template and corresponding extraction rules:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Telephone,
    @Address
}

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Telephone[ANCESTOR(29700)]//  29700, phone number,
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Address[TYPE(ADR)]
    }
}

If the rules are run against this text:

Doug Smith lives at 1540 Chicago Avenue, Baltimore. Doug's number is 555-234-567

the text intelligence engine will generate three records for the PERSONAL_DATA template, one with the @Name field (with two instances of value Doug Smith), another with the @Telephone field and the last with the @Address field.

Template: PERSONAL_DATA

@Name
Doug Smith

Template: PERSONAL_DATA

@Telephone
555-234-567

Template: PERSONAL_DATA

@Address
1540, Chicago Avenue - Baltimore

There is no aggregation: each rule fills a single field, so the generated records only contain one field.

If the merge option is added to the template definition:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Telephone,
    @Address

    MERGE WHEN SENTENCE
}

it will cause the generation of two compound records, one per sentence:

The first contains a person's name and an address.
The second contains a person's name and a telephone number.

Template: PERSONAL_DATA

@Name	@Address
Doug Smith	1540, Chicago Avenue - Baltimore

Template: PERSONAL_DATA

@Name	@Telephone
Doug Smith	555-234-567

If the merge option is modified like this:

MERGE WHEN DOCUMENT

a single compound record will be generated for the entire document and it will have all the template fields set.

Template: PERSONAL_DATA

@Name	@Telephone	@Address
Doug Smith	555-234-567	1540, Chicago Avenue - Baltimore

The merge option also has a role when a cardinal field is defined in a template. In fact, the combination of the merge option with the cardinal attribute will aggregate two separate records into one, if both have the same value for the cardinal field.

The syntax is:

TEMPLATE(templateName)
{
    @fieldName_1(C),
    @fieldName_2,
    ...
    @fieldName_n

    MERGE WHEN scope
}

For a more complete description of the cardinal attribute, please see the dedicated topic.

Scope peculiarities

Overview

Just like categorization and extraction rules, merge scopes correspond to subdivisions of the input text either generated by the disambiguator (DOCUMENT, SENTENCE and PARAGRAPH) or defined for a specific project (SECTION and SEGMENT).
With the exception of DOCUMENT, which is specific to the merge option, the other scopes are the same as those of categorization and extraction rules, although the merge option syntax is simpler.

DOCUMENT

As the name says, the DOCUMENT scope is the whole input document. Its effect is that all simple records are merged into one potentially big record, no matter the position of the text that was extracted to fill their fields.

SENTENCE, CLAUSE and PARAGRAPH

With SENTENCE, CLAUSE and PARAGRAPH scopes, a compound record is generated out of simple records in which the fields have been set with the values extracted from the same sentence, clause or paragraph.

Unlike in categorization and extraction rules, it is not possible to use the multiplier (*) to extend the scope to two or more consecutive sentences or paragraphs.

As in categorization and extraction rules, SENTENCE, CLAUSE and PARAGRAPH scopes can be combined with SECTION and SEGMENT scopes just like in categorization and extraction rules.

The syntax is:

WHEN [SENTENCE | PARAGRAPH | CLAUSE [(type)]] IN SECTION | SEGMENT

where:

name is the name of the section or segment.
type is the optional clause type.

For example, the followings are valid scopes:

WHEN SENTENCE
WHEN PARAGRAPH
WHEN PARAGRAPH IN SEGMENT(COVER_PAGE)
WHEN SENTENCE IN SECTION(TITLE)

SECTION and SEGMENT

With SECTION and SEGMENT, a compound record is generated out of simple records in which the fields have been set with the values extracted from the same section or segment.

As in categorization and extraction rules, advanced combinations of SECTION and SEGMENT scopes can be be defined. These include:

The intersection of a section with one or more segments.
The intersection of two or more segments.

The syntax is:

WHEN SECTION(sectionName:segmentName) WHEN SEGMENT(segmentName:segmentName)

For example, the followings are valid scopes:

WHEN SECTION (BODY)
WHEN SEGMENT (SENDER)
WHEN SECTION (BODY:BYLINE)
WHEN SEGMENT (SENDER:ADDRESSES)
WHEN SECTION (SENDER, RECEIVER:ADDRESSES)

Warning

MERGE WHEN SEGMENT merges records from the same segment, regardless of the scope of the underlying extraction rules. Therefore, if two segments, for example S1 and S2, overlap and records are generated out of the overlapping zone, because the rules have one of the two segments such as S1 as its scope, the records will be merged in the exact same way twice, one for each segment, because the records will have been generated by the second segment too. Therefore, the compound records will be identical.