PATTERN attribute overview
Syntax
The PATTERN
attribute matches the text of one or more consecutive tokens by means of regular expressions.
The syntax is:
PATTERN("regularExpression1"[, "regularExpression2", ...])
where:
PATTERN
is the attribute name and must be written in uppercase.regularExpression#
refers to a regular expression which must be written in quotation marks.
Behavior
The following rules determine the behavior of a PATTERN
attribute:
- The attribute will be true only if the text it matches completely covers the text of one or more consecutive tokens.
- Regular expressions can span consecutive tokens within the rule's scope.
- All instances of regular expressions are matched, and the one matching the highest number of tokens is chosen.
- If a single regular expression contains alternatives and one alternative is matched, the subsequent alternatives are ignored.
For example, consider these two categorization rules:
SCOPE SENTENCE
{
DOMAIN(housing)
{
PATTERN("hous(e|ed|es)")
}
}
SCOPE SENTENCE
{
DOMAIN(housing)
{
PATTERN("hous(es|ed|e)")
}
}
The two PATTERN
attributes in the rules seem equivalent: hous followed by any string between e, ed and es, but they do not produce the same effect.
If the text is:
house
housed
houses
The regular expression in the first rule:
hous(e|ed|es)
matches all the lines, but activates the rule in the first line only.
This is because the match is always triggered by the first alternative (hous + e), therefore the other two alternatives are ignored. In other words, the pattern completely covers line 1 so it triggers; it does not, however, cover the last characters of lines 2 and 3 so the match is only partial, thus the PATTERN
is false and consequently, the whole rule's condition is false, therefore it does not trigger.
The sequence of operations is the following::
- First line (house)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is full, the
PATTERN
attribute is true, the condition is true, the rule is activated.
- Does hous + e match? YES! → ignore subsequent alternatives, the match is full, the
- Second line (housed)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the
PATTERN
attribute is false, the condition is false, the rule is not activated.
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the
- Third line (houses)
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the
PATTERN
attribute is false, the condition is false, the rule is not activated.
- Does hous + e match? YES! → ignore subsequent alternatives, the match is partial, the
The regular expression in the second rule:
hous(es|ed|e)
matches all lines and activates the rule every time.
The sequence of operations is the following:
- First output (house)
- Does hous + es match? NO.
- Does hous + ed match? NO.
- Does hous + e match? YES! → no subsequent alternatives to ignore, the match is full, the
PATTERN
attribute is true, the condition is true, the rule is activated.
- Second output (housed)
- Does hous + es match? NO.
- Does hous + ed match? YES! → ignore subsequent alternatives, the match is full, the
PATTERN
attribute is true, the condition is true, the rule is activated.
- Third output (houses)
- Does hous + es match? YES! → ignore subsequent alternatives, the match is full, the
PATTERN
attribute is true, the condition is true, the rule is activated.
- Does hous + es match? YES! → ignore subsequent alternatives, the match is full, the
PATTERN scope
If you use the PATTERN
attribute in combination with other attributes, be aware that it acts on the rule scope.
For example, if you want to extract email addresses ending with com, with this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@EMAIL[TYPE(MAI) + PATTERN("(?i)^(.*(com).*)$")]
}
}
applied to this text:
My email address is [email protected] and the company I work for is expert.ai.
the email address will be equally extracted, even though it does not end with com, because the regular expression acts according to the rule scope—SENTENCE
in this case—and triggers because of the lemma company containing the sub-string com.
If you use a strict regular expression to act on email addresses only, like this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@EMAIL[TYPE(MAI) + PATTERN("[\w\-\._]+@[\w\-\._]+\.com")]
}
}
applied to the text above, no email address will be extracted, because the regular expression acts on strings corresponding to email addresses ending with com.