By-rule aggregation
Syntax
By-rule aggregation is the basic way to obtain records containing two or more fields. It consists of having more than one field-prefixed operand in a rule's condition, like the following example:
//optional comment describing the rule
SCOPE scopeOption
{
IDENTIFY(templateName)
{
@fieldName1[operand]
operator
@fieldName2[operand]
...
}
}
Rules with this syntax will extract fields only if their value occurs in the context of the condition and aggregate them in the output record.
Single field records
The syntax of the simplest (non aggregating) extraction rule is:
SCOPE scopeOption
{
IDENTIFY(templateName)
{
@fieldName[operand]
}
}
For example, given this template:
TEMPLATE(PERSONAL_DATA)
{
@Name
}
this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
}
}
would extract people's names (TYPE(NPH)
) in the @Name field belonging to the PERSONAL_DATA template.
If the rule is run against this text:
Doug Smith lives at 1540 Chicago Avenue, Baltimore.
the output is a record containing only the @Name field.
Template: PERSONAL_DATA
@Name |
---|
Doug Smith |
If the same rule is run against this text:
Doug Smith lives at 1540 Chicago Avenue, Baltimore. He lives there with his wife Norah and their two children.
the output is a pair of records, each containing only the @Name field.
Template: PERSONAL_DATA
@Name |
---|
Doug Smith |
Template: PERSONAL_DATA
@Name |
---|
Norah |
Two people's names are identified and extracted into two different records.
Note
In fact, if a rule is activated by several tokens, the engine will generate a separate record for each of them.
Multiple field records
Usually, extraction projects require templates with more than one field, for example:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Telephone,
@Address
}
In this scenario, it's still possible to use single-field rules like these:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
}
IDENTIFY(PERSONAL_DATA)
{
@Telephone[ANCESTOR(29700)]// 29700, phone number
}
IDENTIFY(PERSONAL_DATA)
{
@Address[TYPE(ADR)]
}
}
If the above rules are used to analyze the following sample text:
Doug Smith lives at 1540 Chicago Avenue, Baltimore. Doug's number is 555-234-567.
the output will contain three different records:
Template: PERSONAL_DATA
@Name |
---|
Doug Smith |
Template: PERSONAL_DATA
@Telephone |
---|
555-234-567 |
Template: PERSONAL_DATA
@Address |
---|
1540, Chicago Avenue - Baltimore |
This schema has a serious weakness: it has the highest probability of extracting unrelated data.
In the example case, nothing guarantees that the phone number and the address refer to a particular person; it is not even guaranteed that they refer to a person.
This can be acceptable when it is certain that input documents only contain related data. In the case of personal data, it would mean that each document always contains one (and only one) person's name and other data that is related to that person. Therefore, the documents would be personal records indeed, which are uncommon occurrences.
On a much more frequent basis, the documents will contain both related and unrelated data and the project aim is to extract the related data only.
Multi-field templates are an implicit declaration of a wanted correlation, in which it is desirable to produce output records containing related data.
One way to reach this goal is to write conditions that model the relation between fields. Possible relations between fields are declared in the rule, hence the concept of by-rule aggregation.
This relation can be simple co-occurrence in the same scope, positional (e.g., "the value for field2 usually comes after the value for field1") or syntactic (e.g "usually the value for field1 is the subject and the value for field2 is the object).
For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
AND
@Telephone[ANCESTOR(29700)]// 29700, phone number
}
}
extracts both @Name and @Telephone fields based on the assumption that, to be considered related, it is sufficient that they co-occur in the same sentence. When triggered, this rule produces a record that contains a pair of supposedly related fields:
Template: PERSONAL_DATA
@Name | @Telephone |
---|---|
Doug Smith | 555-234-567 |
When applicable, positional sequences combined with the use of additional non-extraction constraints make conditions less loose and increase the probability of capturing real relations between extracted data.
Consider the following sample rules:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
>
LEMMA("live")
>
@Address[TYPE(ADR)]
}
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
>>
KEYWORD("'s")
<1:3>
@Telephone[ANCESTOR(29700)]// 29700, phone number
}
}
Here the mere presence of a person's name and other personal data in the same sentence is not enough to consider the data as related. The reciprocal position and the relation of the data are defined using positional sequences and other elements, unrelated to extraction, as additional constraints. For example, in the first rule, a person's name is extracted along with an address only if the sentence explicitly states that the person lives at a given address.
The rules above potentially "capture" less, but you can use the OPTIONAL
and the MANDATORY
operators to model all the foreseeable combinations in a smaller amount of rules.
Alternatively, in the example case above, it is also common to have sentences containing a couple pieces of data (for example, the person's name and his/her address) rather than having values for all the template fields specified in the same sentences. The rules will typically aggregate two fields at a time and the cardinal field definition and the merge option could be used to create compound records containing the maximum number of supposedly related fields found in the same scope.