Skip to content

Segmentation rules

The fundamental aim of segmentation rules is to define dynamic segment boundaries. There are two main ways to identify a segment’s boundaries:

  • By specifying a linguistic condition and a scope. The instance of the segment will share the same position of the portion of text defined by the scope.
  • By specifying both the linguistic condition that will allow the segment begin and the linguistic condition that will allow the segment end.

The syntax of a simple segmentation rule is the following:

SCOPE scopeOption
{
    SEGMENT(segmentName)
    {
        condition
    }
}

For example, consider the following sample text:

About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this accu-check gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

and suppose that for the use case, a portion of text which deals with malfunctions must be identified.

The following rule:

SCOPE PARAGRAPH
{
    SEGMENT(MALFUNCTION)
    {
        KEYWORD("n't","not")
        >
        LEMMA ("work")
    }

}

identifies the segment MALFUNCTION using a single point of reference: the presence in the text of the lemma work, preceded by a negation. Moreover, the boundaries of the segment will share the same position of the portion of text defined by the scope. In this case, the boundaries are provided by the scope option PARAGRAPH. The highlighted text corresponds to the instance of the segment:


About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this accu-check gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.
 

BEGIN and END

A more comprehensive way of defining a whole block would be to find its beginning and its end. Consider the following syntax:

SCOPE scopeOption
{
    SEGMENT(segmentName|BEGIN)
    {
        condition
    }

    SEGMENT(segmentName|END)
    {
        condition
    }
}

The user can decide where a segment begins and where it must end by defining (at least) two rules per segment in which the syntax keywords BEGIN and END are used after the segment name in each of the rules.

In the following sample text, the re-insured sum starts with the heading SUM REINSURED and ends where the LIMITS section starts.

Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

This can be expressed with the following rules:

SCOPE SENTENCE
{
    SEGMENT(SUM_REINSURED|BEGIN)
    {
        KEYWORD("SUM REINSURED")
    }

    SEGMENT(SUM_REINSURED|END)
    {
        KEYWORD("LIMITS")
    }
}

In this case, two points of reference have been used to create the segment, where the first condition, marked as BEGIN, sets the opening boundary, while the second, marked as END, sets the closure. The portion of text highlighted in yellow corresponds to the instance of the segment:


Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

To use segmentation rules most effectively, it is important that they are set up to identify concepts that often recur in the set of documents to be processed for a given project. With the exception of sporadic special cases, where the beginnings and the endings of segments can be identified with almost ad hoc rules, a good set of segmentation rules must be in some way predictive, so that they can also encompass variants of known forms and layouts.

Note

In case more instances of the same segment overlap each other, a bigger single instance will be created.

BEFORE and AFTER

Advanced segmentation syntax allows the developer to single out phraseology that precedes or follows the segment to be detected by using the keywords BEFORE or AFTER as follows:

SCOPE scopeOption
{
    SEGMENT(segmentName|BEGIN_option) 
    {
        condition
    }

    SEGMENT(segmentName|END_option) 
    {
        condition
    }
}

where BEGIN_option and END_option correspond to one of the following conditions:

  • BEGIN_BEFORE: the segment begins with the sentence before the sentence matched by the linguistic condition.
  • BEGIN_AFTER: the segment begins with the sentence after the sentence matched by the linguistic condition.
  • END_BEFORE: the segment ends with the sentence before the sentence matched by the linguistic condition.
  • END_AFTER: the segment ends with the sentence after the sentence matched by the linguistic condition.

Segmentation rules score

When working with segments, it is possible to define several rules for each boundary, as the number of opening and closing conditions may vary according to the type of document. In some cases, some concepts identified by means of segmentation rules can represent stronger points of reference to define a segment boundaries than others. It is possible to highlight this difference in the rules and mark some concepts as more relevant while others as less relevant. This can be achieved by adding a score option to the rules using the following syntax:

SCOPE scopeOption
{   
    SEGMENT(segmentName|boundaryTypeOption:scoreOption)
    {
        condition
    }
}

The name of the segment must be followed by the boundary type defined by the rule as well as one of the score options. Score options can be of two types:

  • Default score option
  • Custom score option

Default score option

Segmentation and categorization rules share the same default score options listed in the table below:

Option Description
NORMAL The default/implicit score option
LOW Lower than the default
HIGH Higher than the default

The options LOW and HIGH allow the user to assign a a slightly different score to a boundary compared to the default option and they can also be used to assign a higher or lower relevance of a boundary compared to another. The correct use of these options must consider:

  • The use of the default score in most cases.
  • The use of HIGH to give emphasis to a particular rule, for example one containing a concept or combination of concepts which is not ambiguous and will certainly result in a valid boundary (e.g. the main or most frequent beginning or end of a segment).
  • The use of LOW to give less importance to a rule, for example one containing a slightly ambiguous concept which you are neither willing to exclude a priori nor willing to rely on in every case (for example special-case or unusual segments beginning or end).

Custom score option

Similar to categorization rules, it is possible to create custom score options. They can be defined in the config.cr file and they can be shared among both categorization and segmentation rules.
The syntax is:

SCORES
{
  @scoreOptionName:points,
  ...
}

For example:

SCORES
{
  @LOWER:1,
  @HIGHER:20
}

Once defined, the names of the new options can be used in the segmentation rules to allow for a greater variability of rules score.

Note

Don't use language keywords as score option names.

Scope options in segmentation rules

As for categorization and extraction rules, every segment rule needs a SCOPE option to be chosen in order to define two elements:

  • The portion of text in which a single rule or a group of rules will act upon.
  • The portion of text on which the segment will be extended.

Any of the standard or custom scope options available can be used. However, there are some restrictions specific to segmentation rules that must be detailed.

  • The SCOPE options: SENTENCE / PARAGRAPH / CLAUSE / PHRASE can always be used.
  • The SCOPE options: SECTION / SEGMENT / CLAUSE (clause_type) / PHRASE (phrase_type) can be used except in those cases where the BEGIN or END statements are used to separately define the boundaries of a segment.

Phrase and clause

PHRASE and CLAUSE scope options can be used in the cases specified above. Additionally, they must only be intended as portions of text where a segmentation rule has to be verified. In fact, since segments' extensions can't disregard sentence boundaries (for example segments can not be shorter than a sentence), CLAUSE and PHRASE scope options do not determine the portion of text on which the segment will be extended.

Sentence and paragraph

The SCOPE options SENTENCE and PARAGRAPH can be used in any of the ways described in the cases specified above. However, when the following syntax is used:

SCOPE PARAGRAPH|SENTENCE*n.
{
    segmentationRule(s)
}

A distinction must be made between the programmed scope and the real scope of a rule, where "programmed scope" is the most extended portion of text on which a rule acts upon, and "real scope" is the portion of text that is really included in the segment.

For example, if we define a rule scope in the following way

SCOPE SENTENCE*3
{
    SEGMENT(segment_name)
    {
        //condition//
    }
}

we are declaring that the rule condition has to be verified within three consecutive sentences of the input document. Actually, three sentences are the maximum possible scope for the rule to be verified. The rule could also be verified in a single sentence or in two sentences, depending where the elements specified in the condition are found. Therefore, notwithstanding the maximum scope declared in a rule, the real scope is determined by the portion of text really containing the concepts that the rule looks for.

Section and segment

The use of SECTION and SEGMENT scope options has a peculiar meaning when defining segmentation rules. In fact, when using these options for categorization or extraction rules, the user’s aim is to look for concepts in a specific portion of text. When defining segmentation rules, on the other hand, the output of a rule acting within a section or another previously defined segment is a new segment created within the section or segment specified in the rule SCOPE. The possible aims to be achieved by means of this technique are two:

  • Create nested segments.
  • Upgrade a whole section or a whole segment to a new segment.

Nested segments

Using the scope option SEGMENT it is possible to define dynamic segments within other previously created segments. The syntax is the following:

SCOPE scopeOption
{
    SEGMENT(segmentName1)
    {
        condition
    }
}

SCOPE SENTENCE IN SEGMENT(segmentName1)
{
    SEGMENT (segmentName2)
    {
        condition
    }
}

The first rule (or set of rules) defines a segment using any scope options other than SEGMENT. The second rule uses the first segment as scope in order to define, within the first segment itself, another segment, nested in the first one.

Circular References

When defining nested segments it is fundamental to pay attention not to define circular references. Should it occur, the software will be unable to assign the correct order to the segmentations rules, thus making it impossible to execute them.

Consider the following examples:

SCOPE SENTENCE
{
    SEGMENT(segment_name1)
    {
    //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment1)
{
    SEGMENT(segment2)
    {
        //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment2)
{
    SEGMENT(segment3)
    {
        //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment3)
{
    SEGMENT(segment_name1)
    {
        //condition//
    }
}

The rules above define:

  • Segment1 first.
  • Then segment2 is defined within segment1.
  • Then segment3 is defined within segment2.
  • At the end, segment1 is defined within segment3.

The last rule invalidates the whole set because it introduces a circular reference in the code. This would generate an error and no rule would be compiled and applied.

Sections and segments promotion

By using segmentation rules it is possible to promote a whole section or segment to a new segment which coincides with the original section or segment. In other words, it is possible to generate a segment identical in position and extension to another segment or section in order to create a sort of “duplicate” of an existing segment or section. This technique is useful when different operations must be performed within a single section or segment (linguistic rules, filters, post-processing…) and the developer needs to differentiate a document portion where these actions need to be performed. This can be achieved only when the new segment includes the entire original section or segment, not just a part of it.

For example, the following sample rule:

SCOPE SECTION(HEADLINE)
{
    SEGMENT(BOLD)
    {
        //condition//
    }
}

is correct and accepted because the entire HEADLINE section is going to be part of the new segment BOLD.