Composition overview
In extraction rule writing, composition is one of the two optional features in the association of an extracted token and a field which controls which data is transferred into the field.
In particular, composition manipulates and assembles the information in the text to return complete, uniform and non redundant extraction data. In fact, composition allows the user to combine several elements of a positional or logical sequence, and decide in which order they will become part of the extraction output. Composition can be employed only in rules using one of the sequence operators available in the Rules language. Composition is able to return two or more elements included in a sequence with the option to decide the position of each element in the final output.
The syntax of a composition is as follows:
SCOPE scopeOption
{
IDENTIFY(templateName)
{
@field1[attribute1]|[composition]
sequenceOperator
@field1[attribute2]|[composition]
}
}
where composition
refers to a pound sign (#
) followed by a whole number. For example:
SCOPE scopeOption
{
IDENTIFY(templateName)
{
@field1[attribute1]|[#1]
sequenceOperator
@field1[attribute2]|[#2]
}
}
In each rule, the numbering must begin with the number one (1), without any gap in the sequence. Each so-called sequence "chunk" (#1
, #2
, etc.) must share the same destination field and be part of a sequence. Each "chunk" symbol must be written between brackets and positioned at the end of a full and correct extraction rule; a vertical bar (called pipe, |
) separates the rule from the composition. The number associated to each chunk represents the position that the extracted value will take in the final output.
When using this option, it is not mandatory to use the extraction syntax on every element of the sequence; in other words, it is possible to choose which elements will be part of the extraction and which elements will only act as constraints in the rule. When composition is used, the extracted text is always the same as the text found in the original document unless a composition is associated with the transformation.
Composition is useful when a complex management of extraction data is required. In particular, it is beneficial when the information is not concentrated in one single part of a document. A possible use of a composition normalization is when final values are found in different parts of the text, but must be selected and combined together to reach the final output. Consider the following example:
SCOPE SENTENCE
{
IDENTIFY(PORTS)
{
@Port[SYNCON(39541)]|[#1] // 39541: port, city port, seaport
>>
KEYWORD("of")
>>
@Port[ANCESTOR(39663)]|[#2] // 39663: city, town
}
}
This rule is meant to extract the concept of port (SYNCON(39541)
), strictly followed by (double greater than sign, >>
) keyword of, followed in turn by the chain of concepts starting from the syncon for city (ANCESTOR(39663)
). The extraction syntax is applied only to the first and the last element of the rule, in order to extract just the values matched by these two attributes. Note that the two attributes enclosed in the extraction syntax extract different values for the same field (@Port) and that the composition syntax is applied to each one of them. The #1
and #2
declarations determine the following behavior: every element extracted for the attribute marked with #1
will be the first element of the final output while every element extracted for the attribute marked with #2
will be the second element of the final output.
If the rule above is run against the following sample text:
Poll: Should Seattle Port CEO choose between jobs?
Port of Seattle CEO Tay Yoshitani is coming under growing criticism for trying to hold onto his $367,000 job and a seat on the board of Expeditors International. For sitting on the board, Yoshitani receives a $30,000-a-year retainer, $1,000 per-diem for board meetings or other company work and $200,000 in restricted stock each year. Ka-ching!
Yoshitani says the dual roles do not represent a conflict of interest, an opinion apparently concurred with by the Port's top attorney. But commissioners are becoming increasingly vocal about this discomfort with his roles. Thirteen state legislators wrote to Port commissioners expressing concern about Yoshitani's jobs.
The text contains one combination of values - Port of Seattle - matching the sample rule because Port is recognized as an expression of syncon 39541 (port) and Seattle is recognized as a descendant of syncon 39663 (city). Port is then chosen as chunk #1 of the composition and Seattle as chunk #2. The final extraction is Port Seattle, the of keyword is not part of the composition.
Now consider the same rule with a slight adjustment to the composition chunks:
SCOPE SENTENCE
{
IDENTIFY(PORTS)
{
@Port[SYNCON(39541)]|[#2] // 39541: port, city port, seaport
>>
KEYWORD("of")
>>
@Port[ANCESTOR(39663)]|[#1] // 39663: city, town
}
}
The only difference is that the order of the two chunks has been inverted, so the extracted value for field @Port becomes Seattle Port.
The sample document contains another reference to the same port (Seattle Port). If a second rule is added to the set in order to extract this instance, an interesting behavior of the engine could be observed:
SCOPE SENTENCE
{
IDENTIFY(PORTS)
{
@Port[ANCESTOR(39541) + TYPE(NPR)] // 39541: port, city port, seaport
}
}
This rule is meant to extract any proper noun (TYPE(NPR)
) - both known or unknown to the Knowledge Graph - recognized to be the name of a port (ANCESTOR(39541)
). Since the second rule would extract Seattle Port and this value coincides with the outcome of the first rule, the engine creates only one output record, thus assuring non redundant and normalized results.
In other words, the two rules manage two different forms of the same concept and extract different elements in different positions. However, the composition syntax allows the user to deconstruct and recompose the elements so that data normalization can be applied.