Skip to content

SEGMENT

SEGMENT is an extraction transformation option that can be described as a completion feature of the matched value rather than a normalization. It adds elements surrounding the original matched data to the final extracted value.

Its action is based on the concept of segment, which is a custom text subdivision that can be optionally defined for a project.

The SEGMENT option returns the whole segment containing the value matched by an attribute.

Warning

For this reason, this transformation option can possibly extract a huge amount of textual data.

This option should only be used if one or more segments have been previously defined in the project. Also, at least one segment must be specified in the rule scope.

The syntax of the SEGMENT option is the following:

SCOPE scopeOption IN SEGMENT(segmentName)
{
    IDENTIFY(templateName)
    {
        @field[attribute]|[SEGMENT]
    }
}

This option is useful in situations where it's necessary to expand the extraction output revolving around a matched element.

Consider the following example:

SCOPE SEGMENT(TITLE)
{
    IDENTIFY(ARTICLE)
    {
        @Title[TYPE(NOU)]|[SEGMENT]
    }
}

The purpose of this rule is to extract nouns (TYPE(NOU)) within a previously defined segment called TITLE (SCOPE SEGMENT (TITLE)). If this condition is verified, the SEGMENT transformation option will ensure that every extracted value will be expanded to the segment where the nouns are found.

Consider the extraction output if the rule above is run against the following sample text:

Flu Widespread, Leading a Range of Winter's Ills
By DONALD G. McNEIL Jr. and KATHARINE Q. SEELYE
Published: January 9, 2013
It is not your imagination - more people you know are sick this winter, even people who have had flu shots.
The country is in the grip of three emerging flu or flulike epidemics: an early start to the annual flu season with an unusually aggressive virus, a surge in a new type of norovirus, and the worst whooping cough outbreak in 60 years. And these are all developing amid the normal winter highs for the many viruses that cause symptoms on the "colds and flu" spectrum.
Influenza is widespread, and causing local crises. On Wednesday, Boston's mayor declared a public health emergency as cases flooded hospital emergency rooms.

Let's suppose the segment TITLE was detected based on a positional criterion (e.g., the first line of the text):

Flu Widespread, Leading a Range of Winter's Ills

This is the first condition that must be verified for the rule to be triggered.
Here, the segment TITLE allows two actions:

  • As the scope of the rule, it restricts the extraction of nouns to just the portion of text it delimits, in this case, the title.
  • As the transformation option, the segment itself is the final output of the extraction process.

The final result is the extraction of one instance of the segment text. This is because the rule was triggered by the four nouns contained in the segment (Flu, Range, Winter, Ills), but the engine was then able to recognize that each noun was found within the same segment and thus returns a single record instead of four identical records.