Skip to content

Flexible sequence

Unconstrained range

The flexible sequence operator (less-than sign followed by greater-than sign, <>) requires that the two tokens matched by the operands on its sides are positioned one after the other and in the same sentence, regardless how many tokens separate them.

All types of sequences can act both at the atom or token level of a sentence, according to the attribute after them.

The syntax is:

operand1
<>
operand2

Consider the following example:

SCOPE SENTENCE
{
    IDENTIFY (TEST)
    {
        @company[ANCESTOR(37475) + TYPE(NPR) + ROLE(SUBJECT)]//  37475: company, enterprise, firm, house,
        >
        LEMMA("produce", "design")+ TYPE (VER)
        <>
        @product[ANCESTOR(78687)]//  78687: artifact, artefact
    }
}

This extraction rule is meant to extract proper names of companies and the products they manufacture.
The first operand matches any concept descending from syncon 37475 (company), but is limited to proper nouns (+TYPE(NPR)), thus excluding any common noun like limited liability company. The operand is further restricted so that only proper nouns playing the role of subject in a sentence or a clause (+ROLE(SUBJECT)) are matched.
The second operand matches the inflections of the verbs produce and design and the matches third the concepts descending from syncon 78687 (artifact).
A loose sequence operator is used to combine the first and the second operand while the flexible sequence operator is used between the second and the third.

If the rule is run against the following sample text:

The company Prada produces high-end ready-to-wear clothes for men and women. In addition Prada designs a range of children's clothes, fragrances, cosmetic products and accessories for men and women, including handbags, shoes, wallets and sunglasses.

it is triggered several times.
In the first sentence Prada is matched by the first operand, produces is matched by the second and clothes by the third. The rule is triggered because the tokens are found in the expected order.
The second sentence activates the rule multiple times because of Prada, designs and all the "artifacts" (children's clothes, fragrances, products, handbags, shoes etc.); the distance from the previous token doesn't matter since they are in the same sentence.

Constrained range

The flexible sequence operator accepts a parameter to specify the minimum required distance and the maximum allowed distance between the tokens corresponding to the combined operands.

The parameter syntax is:

<min:>

or:

<:max>

or:

<min:max>

where min and max are two integer numbers indicating, respectively, the minimum required distance and the maximum allowed distance.

When both min and max are specified, min can be negative. The distance is measured in words, punctuation marks and other symbols indicating the end of a sentence (for example period, exclamation and question marks) are counted as well.

Consider the following example:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        TYPE(NPH)
        <1:3>
        ANCESTOR(71230)//# 71230: arrest, apprehend, seize, sneeze, slough, take in, nab, collar, pick up, cop, nail
    }
}

The condition of this categorization rule matches any person's name followed, in a range between one and three spaces inside the same sentence, by any expression of any concept that is a descendant of syncon 76936 (to arrest).

If the rule is run against the following sample text:

Cook County Sheriff Tammy Wheeler informed the press that Chris Collins was finally arrested today by Rhode Island State Police with domestic disorderly conduct and assault.

Chris Collins is matched by the first operand and was (...) arrested is matched by the second. The distance between the two tokens is three words:

1. was
2. finally
3. arrested

so the rule is triggered. By contrast, Tammy Wheeler and was (...) arrested, which are matched too by the two operands, do not trigger the rule because the distance between the matched tokens (nine words) exceeds the allowed maximum.

Now, consider the following sample text:

Cook County Sheriff Tammy Wheeler informed the press that today Rhode Island State Police finally arrested Chris Collins with domestic disorderly conduct and assault.

In this case the rule is not triggered because arrested precedes Chris Collins.
If the range is changed this way:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        TYPE(NPH)
        <-3:3>
        ANCESTOR(71230)//# 71230: arrest, apprehend, seize, sneeze, slough, take in, nab, collar, pick up, cop, nail
    }
}

the condition is met again because arrested has distance -1 from Chris Collins.

Warning

Do not combine negative attributes with a flexible sequence operator having a negative range. If you have to use negation, write two separate rules.

Note

Range <1:1> is equivalent to the strict sequence operator (>>).

The <:max> syntax is equivalent to <1:max>, however the second form is preferred because it's more intelligible.

It is also possible to have <0:max>.
Zero is the conventional distance between two words that, together, are an inflection of a given compound lemma (for example state police). Distance zero is useful when it is convenient to match compound lemmas together with possible variants that are not compound lemmas. Consider the following rule:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("state")
        <0:3>
        KEYWORD("police", "trooper", "troopers")
    }
}

The condition states that the keyword state defined in the first attribute, has to precede a token matched by one or more of the keywords in the second attribute (police, trooper, troopers). The rule will be triggered only if the two elements are in the same sentence and are separated by a minimum of 0 to a maximum of 3 words. The range constraint min=0 allows the sequence to be valid even if the two tokens belong to the same lemma.

If the rule above is run against the following text:

New details are emerging as police continue to investigate a shooting spree that left several people dead and three state police troopers injured Friday morning in Central Pennsylvania. The incident started at a small church in Frankstown Township in Blair County, which is near Altoona. When it was over, three state troopers had to be hospitalized for injuries.
"I think we have three very fortunate state police members tonight", said Lt. Col. George Bivens, of the Pennsylvania State Police.

the rule is activated by:

  • First sentence, state police
  • First sentence, state (...) troopers
  • Third sentence, state troopers
  • Fourth sentence, state police

In particular, in the first sentence, state police troopers triggers the rule two times, one because of state + police and the other because of state + troopers. The rule is triggered even if the disambiguator recognizes state police as a single lemma because of the minimum distance zero constraint.

The <min:> is equivalent to <(positive-negative) number:(infinitum)>

For example, the following rule:

SCOPE SENTENCE
{
    DOMAIN(1.1)
    {
        LEMMA("police officer")
        <2:>
        LEMMA("arrest")
    }
}

when applied to the following input text:

Police officer Scott Perkins, who was on duty on the 19th of July, arrested two teenagers who were stealing from a pawn shop.

correctly triggers, because the lemma arrest is found at least after two spaces—in a range from two to infinitum—from the lemma police officer.

Inverted syntax: <<:n>

The <<:n> operator is an expansion of the </<< syntax and it is mainly used with negations.

Let's take a look at an example. If the following rule:

SCOPE SENTENCE
{
    DOMAIN(dom1)
    {
        LEMMA("meeting")
        <>
        !KEYWORD("not")
        <<:3>
        LEMMA("schedule")
    }
}

is run against the following sentences:

The meeting with the customer was scheduled.
The meeting with the customer has not been officially scheduled yet.

the rule will trigger with the first sentence, because the lemma schedule is found after the lemma meeting in the range specified by the flexible sequence—infinite in this case—yet it must not be preceded within 3 tokens by the keyword not.