Segmentation rules

Overview

Each segment has a distinctive name and can have from zero to many instances in a document. For example, a segment named CHAPTER_INTRO, containing the introduction to the chapter, can have multiple instances in a book, one for each chapter.

A segment instance is composed of one or more consecutive text subdivisions no smaller than a sentence.

Segmentation rules produce potential segment instances or boundaries of potential instances.

Rules also assign a score to potential segment instances and boundaries. The default score is 10.

Actual segment instances are determined by a selection algorithm using the information produced by the rules.

There are two types of segmentation rules:

Instance generating rules
Boundary generating rules

Rule types are described below.

Segmentation rules are evaluated after tagging rules and before categorization and extraction rules.

This means that segmentation rules can make reference to tag instances, but tagging rules cannot reference segments.

It is possible to use segment based scopes in segmentation rules. By so doing, rules that reference segments are evaluated after the other rules. This allows segment nesting, a technique by which you can have segment instances "inside" other segment instances. You cannot create circular references, that is use a segment as the scope of rules that generate instances or boundaries of the same segment.

Declaration

In order to be referenced in segmentation rules, every segment must first be declared in a rules' file. In a new project, the config.cr file contains the following declaration of two default segments:

SEGMENTS
{
    @SEGMENT1,
    @SEGMENT2
}

It's up to you to use, ignore or discard those predefined segments.

The syntax of the declaration is:

SEGMENTS
{
    @segmentName[,
    @segmentName ...]
}

Instance generating rules

Instance generating rules produce potential segment instances.
For example, consider the following sample text:

About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

The following segmentation rule:

SCOPE SENTENCE
{
    SEGMENT(MALFUNCTION)
    {
        KEYWORD("n't","not")
        >
        LEMMA ("work")
    }
}

will be triggered by the lemma work preceded by a negation and will produce a potential instance of segment MALFUNCTION coinciding with the sentence in which the condition was met, as highlighted below:

About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

The syntax of instance producing rules is:

SEGMENT[[ruleLabel]](segmentName[:score])
{
    condition
}

where:

segmentName is the name of the segment.
ruleLabel is a label that helps identify the rule.
score is optional. If set, it determines the score of the potential segment instance, which is an information that can be used by the selection algorithm. See below the possible values of this option.

Whenever the condition is satisfied, a potential instance of the segment if generated.

The extension of the instance depends on the scope of the rule and the places inside the scope where the condition is matched:

If the scope of the rule is a section or a segment, the instance coincides with the scope, so it's an entire occurrence of a section or of a segment.
If the scope is one or multiple paragraphs, the instance will start with the paragraph in which the condition has its first match and end with the paragraph in which the condition has its last match.
If the scope is one or multiple sentences, the instance will start with the paragraph in which the condition has its first match and end with the paragraph in which the condition has its last match.
If the scope is smaller than a sentence, the instance will coincide with the sentence in which the condition is satisfied.

For example, this rule:

SCOPE PARAGRAPH*10
{
    SEGMENT(CORRELATION)
    {
        LEMMA("prediabetes")
        AND
        SYNCON(100154383,78173)//@SYN: #100154383# [take-away] //@SYN: #78173# [fizzy drink]
    }
}

run on the same sample text above, will produce the multi-paragraph potential instance of the CORRELATION segment highlighted below:

About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

The condition has to match in a ten paragraphs scope. It has its first match (pre-diabetes) in the first paragraph and its last matches (takeaways and fizzy drinks) in the third paragraph, so the potential segment instance will span from the beginning of the first paragraph to the end of the third.

Boundary generating rules

A boundary generating rule produces either a begin boundary or an end boundary.
A begin boundary is a position in the text where a potential segment instance can begin, an end boundary is a position where an instance can end.

For example, consider this excerpt of an insurance contract:

Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

The re-insured sum can be found between SUM REINSURED and LIMITS.
Consider the following rules:

SCOPE SENTENCE
{
    SEGMENT(SUM_REINSURED|END)
    {
        KEYWORD("LIMITS")
    }
}

SCOPE SENTENCE
{
    SEGMENT(SUM_REINSURED|END)
    {
        KEYWORD("SUM REINSURED")
    }
}

The first rule produces and end boundary for the SUM_REINSURED segment, the second rules produces a begin boundary for the same segment. The end boundary is at the end of the sentence where the condition of the first rule has its match, while the begin boundary is at the beginning of the sentence where the condition of the second rule has its match.

Having a begin boundary and an end boundary for the same segment, the selection algorithm can produce the potential segment instance highlighted below, covering all the text between the two boundaries.

Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

The syntax of boundary generating rules is:

SEGMENT(segmentName|boundaryOption[:score])
{
    condition
}

boundaryOption determines the type of boundary and a possible offset of the boundary with respect to the text matched by the condition.

score is optional. If set, it determines the score of the potential segment instance, which is an information that can be used by the selection algorithm. See below the possible values of this option.

The possible values of boundaryOption are BEGIN, BEGIN_BEFORE, BEGIN_AFTER, END, END_BEFORE, END_AFTER.
The values starting with BEGIN cause the generation of a begin boundary, those with a value starting with END cause the generation of an end boundary.

A boundary is generated in three steps:

An ephemeral segment instance is generated interpreting the rule exactly like an instance generating rule (see above).
The base position of the boundary is computed like this:
- For a begin boundary: the start of the ephemeral segment instance.
- For a end boundary: the end of the ephemeral segment instance.
If boundaryOption ends with BEFORE or AFTER, the final position is computed considering an offset from the base position, otherwise the final position coincides with the base position.
The offset is a function of the scope and of boundaryOption:
- If the scope of the rule is a section or a segment, the final position of the boundary will be moved to the beginning or end of the previous or next occurrence of the section or segment.
- If the scope is one or multiple paragraphs, the final position of the boundary will be moved to the beginning or end of the previous or next paragraph.
- Otherwise, the final position of the boundary will be moved to the beginning or end of the previous or next sentence.
The position is moved to the previous occurrence of the text subdivision—section, segment, pargraph or sentence—if boundaryOption ends with BEFORE, to the next occurrence if boundaryOption ends with AFTER. The final position will be at the beginning of the text subdivision for a begin boundary and at the end for an end boundary.

The offset is applied only if the destination text subdivision is available, otherwise the boundary position is not changed.

For example, consider a text with nine paragraphs and boundary generating rules with scope PARAGRAPH.
If the condition of a rule generating a begin boundary is matched in the third paragraph, the position of the boundary will be:

The beginning of the third paragraph itself if boundaryOption is BEGIN.
The beginning of the previous (second) paragraph if boundaryOption is BEGIN_BEFORE.
The beginning of the next (fourth) paragraph if boundaryOption is BEGIN_AFTER.

If the condition of a rule generating an end boundary is matched in the sixth paragraph, the position of the boundary will be:

The end of the sixth paragraph itself if boundaryOption is END.
The end of the previous (fifth) paragraph if boundaryOption is END_BEFORE.
The end of the next (seventh) paragraph if boundaryOption is END_AFTER.

Score option

The scoring option allows you to change the default score of potential segment instances and boundaries. The option can be set to one of the standard values described below or to a custom score.

Option value	Score
`NORMAL`	10
`LOW`	3
`HIGH`	15

Score is used by the selection algorithm.