Skip to content

Groups peculiarities

The difference between match and capture

If you use the PATTERN attribute to set a field in an extraction rule, the groups used in the regular expressions can affect the value of the field.

For example, consider this extraction rule:

SCOPE SENTENCE
{
    IDENTIFY(OPENDATA)
    {
        @KNOWLEDGE_BASE[PATTERN("Wikipedia")]
    }
}

This rule sets the KNOWLEDGE_BASE field to Wikipedia if the text contains a token whose text is Wikipedia. In this case the entire matched text is captured and transferred to the field.

This slightly different rule, however:

SCOPE SENTENCE
{
    IDENTIFY(OPENDATA)
    {
        @KNOWLEDGE_BASE[PATTERN("(Wiki)pedia")]
    }
}

sets the KNOWLEDGE_BASE field to Wiki if the text contains a token whose text is Wikipedia.

Therefore, for the match, it is the entire regular expression which determines the text to be matched.
For the capture, however, if there are no capturing groups, then the entire matched text will be captured and transferred to the field; if there are capturing groups, only the text that was matched by those groups will be captured.

Consider another PATTERN example:

PATTERN("\+\d+\s(\d+\s\d+\s)")

applied to this text:

+44 744 0963112

The attribute is evaluated as true, because the regular expression matches the entire text of three consecutive tokens, however only the sub-expression between parentheses determines the capture, so if the attribute is used to set a field, the value of the field will be:

744 0963112

thus, omitting the first part.

Non-capturing groups affect the match, but not the capture.

For example, if this rule:

SCOPE SENTENCE
{
    IDENTIFY(OPENDATA)
    {
        @KNOWLEDGE_BASE[PATTERN("Wiki(?:pedia|data|base)")]
    }
}

is applied to this text:

Wikipedia
Wikidata
Wikibase

the KNOWLEDGE_BASE field is set three times, respectively to:

Wikipedia
Wikidata
Wikibase

since all the matched text is captured, regardless of the group (which is non-capturing).

Nested groups rule

When a capturing group contains other capturing groups—which, recursively, can contain even more capturing groups within themselves—the text corresponding to the outermost group is captured.

For example, if the text is:

He was awarded the Silver Star for military valor.

and the regular expression is:

((Gold|Silver|Purple) (Star|Cross|Heart))

the overall capture is:

Silver Star

Tip

When there are groups and sub-groups and the entire match needs to be captured, surround the regular expression with parentheses.

Consecutive groups rule

If there are multiple capturing groups at the same level in a regular expression, the overall capture will be the concatenation of what is captured by all the groups, in the order in which the groups are found. A blank character is added as a separator in the concatenation.

For example, if the text is:

AH_808_BF_915

and the regular expression is:

([A-Z]{2})_[0-9]{3}_([A-Z]{2})_[0-9]{3}

the overall capture is:

AH BF

Repetitions

Pay attention to the use of repetitions with capturing groups.

If the text is:

XXXL

the regular expression:

(X)*L

matches the entire text, but captures only:

X

because the capturing group, in fact, corresponds only to the X character, even if the repetition of "zero or more occurrences" is applied to the group using the asterisk character (*).
In order to capture all of the Xs, simply surround the expression with parentheses to create an outer capturing group:

((X)*)L

Lookahead and lookbehind groups

The lookahead and lookbehind groups, whether positive or negative, must not be used in the regular expression syntax of the PATTERN attribute. To obtain the same effect, use the capturing groups or sequence operators plus negations appropriately.

For example, to obtain the same effect of as an expression like: "Capture blue only if followed by sky" (positive lookahead) a lookahead group isn't necessary, just use the capturing groups like this:

(blue) sky

To obtain the effect of a negative lookahead like: "Capture dark only if NOT followed by matter", use a condition like this:

@FIELDX[PATTERN("dark")]
>>
!PATTERN("matter")

Similarly, for a positive lookbehind like: "Capture star only if preceded by red", just use the capturing groups like this:

red (star)

while for a negative lookbehind like: "Capture star only if NOT preceded by cake", you can use a condition like this:

!PATTERN("cake")
>>
@FIELDX[PATTERN("star")]

Backward references

A backward reference is a reference to the text of a previously defined group.

Note

The reference is to the matched text and not to the expression itself.

A backward reference consists of the escape character \ followed by a number between 1 and 9. \1 refers to the first group,\2 to the second, and so on. For example:

(.*)-\1

matches any string which repeats itself, with a central hyphen, such as:

go-go
ha-ha
walla-walla

Comments

Comments are a special type of group. The syntax is:

(?#comment)

Comments are useful to explain complex regular expressions; they affect neither matches nor captures.