PATTERN attribute overview

Syntax

The PATTERN attribute matches the text of one or more consecutive tokens by means of regular expressions.

The syntax is:

PATTERN("regularExpression1"[, "regularExpression2", ...])

where:

PATTERN is the attribute name and must be written in uppercase.
regularExpression# refers to a regular expression which must be written in quotation marks.

Behavior

The following rules determine the behavior of a PATTERN attribute:

The attribute will be true only if the text it matches completely covers the text of one or more consecutive tokens.
Regular expressions can span consecutive tokens within the rule's scope.
All instances of regular expressions are matched, and the one matching the highest number of tokens is chosen.
If a single regular expression contains alternatives and one alternative is matched, the subsequent alternatives are ignored.

For example, consider these two categorization rules:

SCOPE SENTENCE
{
    DOMAIN(housing)
    {
        PATTERN("hous(e|ed|es)")
    }
}

SCOPE SENTENCE
{
    DOMAIN(housing)
    {
        PATTERN("hous(es|ed|e)")
    }
}

The two PATTERN attributes in the rules seem equivalent: hous followed by any string between e, ed and es, but they do not produce the same effect.

If the text is:

house
housed
houses

The regular expression in the first rule:

hous(e|ed|es)

matches all the lines, but activates the rule in the first line only.
This is because the match is always triggered by the first alternative (hous + e), therefore the other two alternatives are ignored. In other words, the pattern completely covers line 1 so it triggers; it does not, however, cover the last characters of lines 2 and 3 so the match is only partial, thus the PATTERN is false and consequently, the whole rule's condition is false, therefore it does not trigger.

The sequence of operations is the following::

First line (house)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is full, the PATTERN attribute is true, the condition is true, the rule is activated.
Second line (housed)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the PATTERN attribute is false, the condition is false, the rule is not activated.
Third line (houses)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the PATTERN attribute is false, the condition is false, the rule is not activated.

The regular expression in the second rule:

hous(es|ed|e)

matches all lines and activates the rule every time.

The sequence of operations is the following:

First output (house)
- Does hous + es match? NO.
- Does hous + ed match? NO.
- Does hous + e match? YES! → no subsequent alternatives to ignore, the match is full, the PATTERN attribute is true, the condition is true, the rule is activated.
Second output (housed)
- Does hous + es match? NO.
- Does hous + ed match? YES! → ignore subsequent alternatives, the match is full, the PATTERN attribute is true, the condition is true, the rule is activated.
Third output (houses)
- Does hous + es match? YES! → ignore subsequent alternatives, the match is full, the PATTERN attribute is true, the condition is true, the rule is activated.

PATTERN scope

If you use the PATTERN attribute in combination with other attributes, be aware that it acts on the rule scope.

For example, if you want to extract email addresses ending with com, with this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @EMAIL[TYPE(MAI) + PATTERN("(?i)^(.*(com).*)$")]
    }
}

applied to this text:

My email address is [email protected] and the company I work for is expert.ai.

the email address will be equally extracted, even though it does not end with com, because the regular expression acts according to the rule scope—SENTENCE in this case—and triggers because of the lemma company containing the sub-string com.

If you use a strict regular expression to act on email addresses only, like this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @EMAIL[TYPE(MAI) + PATTERN("[\w\-\._]+@[\w\-\._]+\.com")]
    }
}

applied to the text above, no email address will be extracted, because the regular expression acts on strings corresponding to email addresses ending with com.