Merge option
The merge mechanism
The merge option determines the merging of simple extraction records into compound records. Simple records disappear in the merge, so only compound records are returned as engine output.
The syntax is:
TEMPLATE(templateName)
{
@fieldName_1,
@fieldName_2,
...
@fieldName_n
MERGE WHEN scope
}
MERGE
and WHEN
are language keywords and must be written in uppercase.
A template can have at most one merge option.
scope
defines the range of the merge, that is the portion of text that has to be considered as the source of simple records to be merged. If, for example, the scope is SENTENCE
, all simple records originating from the same sentence will be merged in a compound record, i.e., a separate compound record is generated for every sentence from which something was extracted. Instead, if the scope is DOCUMENT
, all simple records extracted from the whole document will be merged into one (literally, just one, and potentially large) compound record.
Note
Merging represents a notable exception to the template-table similarity, because it creates compound records that can contain more than one instance of the same field.
Possible scopes are:
SENTENCE
CLAUSE
PARAGRAPH
SEGMENT
SECTION
DOCUMENT
The scopes are illustrated in the next section of this topic.
Merging is useful when extracted data can be considered related, because it's located in the same portion of text.
The followings are examples of the merge option used with different scopes.
Consider this template and corresponding extraction rules:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Telephone,
@Address
}
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
}
IDENTIFY(PERSONAL_DATA)
{
@Telephone[ANCESTOR(29700)]// 29700, phone number,
}
IDENTIFY(PERSONAL_DATA)
{
@Address[TYPE(ADR)]
}
}
If the rules are run against this text:
Doug Smith lives at 1540 Chicago Avenue, Baltimore. Doug's number is 555-234-567
the text intelligence engine will generate three records for the PERSONAL_DATA template, one with the @Name field (with two instances of value Doug Smith), another with the @Telephone field and the last with the @Address field.
Template: PERSONAL_DATA
@Name |
---|
Doug Smith |
Template: PERSONAL_DATA
@Telephone |
---|
555-234-567 |
Template: PERSONAL_DATA
@Address |
---|
1540, Chicago Avenue - Baltimore |
There is no aggregation: each rule fills a single field, so the generated records only contain one field.
If the merge option is added to the template definition:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Telephone,
@Address
MERGE WHEN SENTENCE
}
it will cause the generation of two compound records, one per sentence:
- The first contains a person's name and an address.
- The second contains a person's name and a telephone number.
Template: PERSONAL_DATA
@Name | @Address |
---|---|
Doug Smith | 1540, Chicago Avenue - Baltimore |
Template: PERSONAL_DATA
@Name | @Telephone |
---|---|
Doug Smith | 555-234-567 |
If the merge option is modified like this:
MERGE WHEN DOCUMENT
a single compound record will be generated for the entire document and it will have all the template fields set.
Template: PERSONAL_DATA
@Name | @Telephone | @Address |
---|---|---|
Doug Smith | 555-234-567 | 1540, Chicago Avenue - Baltimore |
The merge option also has a role when a cardinal field is defined in a template. In fact, the combination of the merge option with the cardinal attribute will aggregate two separate records into one, if both have the same value for the cardinal field.
The syntax is:
TEMPLATE(templateName)
{
@fieldName_1(C),
@fieldName_2,
...
@fieldName_n
MERGE WHEN scope
}
For a more complete description of the cardinal attribute, please see the dedicated topic.
Scope peculiarities
Overview
Just like categorization and extraction rules, merge scopes correspond to subdivisions of the input text either generated by the disambiguator (DOCUMENT
, SENTENCE
and PARAGRAPH
) or defined for a specific project (SECTION
and SEGMENT
).
With the exception of DOCUMENT
, which is specific to the merge option, the other scopes are the same as those of categorization and extraction rules, although the merge option syntax is simpler.
DOCUMENT
As the name says, the DOCUMENT
scope is the whole input document. Its effect is that all simple records are merged into one potentially big record, no matter the position of the text that was extracted to fill their fields.
SENTENCE, CLAUSE and PARAGRAPH
With SENTENCE
, CLAUSE
and PARAGRAPH
scopes, a compound record is generated out of simple records in which the fields have been set with the values extracted from the same sentence, clause or paragraph.
Unlike in categorization and extraction rules, it is not possible to use the multiplier (*
) to extend the scope to two or more consecutive sentences or paragraphs.
As in categorization and extraction rules, SENTENCE
, CLAUSE
and PARAGRAPH
scopes can be combined with SECTION
and SEGMENT
scopes just like in categorization and extraction rules.
The syntax is:
WHEN [SENTENCE | PARAGRAPH | CLAUSE [(type)]] IN SECTION | SEGMENT
where:
name
is the name of the section or segment.type
is the optional clause type.
For example, the followings are valid scopes:
WHEN SENTENCE
WHEN PARAGRAPH
WHEN PARAGRAPH IN SEGMENT(COVER_PAGE)
WHEN SENTENCE IN SECTION(TITLE)
SECTION and SEGMENT
With SECTION
and SEGMENT
, a compound record is generated out of simple records in which the fields have been set with the values extracted from the same section or segment.
As in categorization and extraction rules, advanced combinations of SECTION
and SEGMENT
scopes can be be defined. These include:
- The intersection of a section with one or more segments.
- The intersection of two or more segments.
The syntax is:
WHEN SECTION(sectionName:segmentName)
WHEN SEGMENT(segmentName:segmentName)
For example, the followings are valid scopes:
WHEN SECTION (BODY)
WHEN SEGMENT (SENDER)
WHEN SECTION (BODY:BYLINE)
WHEN SEGMENT (SENDER:ADDRESSES)
WHEN SECTION (SENDER, RECEIVER:ADDRESSES)
Warning
MERGE WHEN SEGMENT
merges records from the same segment, regardless of the scope of the underlying extraction rules. Therefore, if two segments, for example S1 and S2, overlap and records are generated out of the overlapping zone, because the rules have one of the two segments such as S1 as its scope, the records will be merged in the exact same way twice, one for each segment, because the records will have been generated by the second segment too. Therefore, the compound records will be identical.