Groups peculiarities
The difference between match and capture
If you use the PATTERN
attribute to set a field in an extraction rule, the groups used in the regular expressions can affect the value of the field.
For example, consider this extraction rule:
SCOPE SENTENCE
{
IDENTIFY(OPENDATA)
{
@KNOWLEDGE_BASE[PATTERN("Wikipedia")]
}
}
This rule sets the KNOWLEDGE_BASE
field to Wikipedia if the text contains a token whose text is Wikipedia. In this case the entire matched text is captured and transferred to the field.
This slightly different rule, however:
SCOPE SENTENCE
{
IDENTIFY(OPENDATA)
{
@KNOWLEDGE_BASE[PATTERN("(Wiki)pedia")]
}
}
sets the KNOWLEDGE_BASE
field to Wiki if the text contains a token whose text is Wikipedia.
Therefore, for the match, it is the entire regular expression which determines the text to be matched.
For the capture, however, if there are no capturing groups, then the entire matched text will be captured and transferred to the field; if there are capturing groups, only the text that was matched by those groups will be captured.
Consider another PATTERN
example:
PATTERN("\+\d+\s(\d+\s\d+\s)")
applied to this text:
+44 744 0963112
The attribute is evaluated as true, because the regular expression matches the entire text of three consecutive tokens, however only the sub-expression between parentheses determines the capture, so if the attribute is used to set a field, the value of the field will be:
744 0963112
thus, omitting the first part.
Non-capturing groups affect the match, but not the capture.
For example, if this rule:
SCOPE SENTENCE
{
IDENTIFY(OPENDATA)
{
@KNOWLEDGE_BASE[PATTERN("Wiki(?:pedia|data|base)")]
}
}
is applied to this text:
Wikipedia
Wikidata
Wikibase
the KNOWLEDGE_BASE
field is set three times, respectively to:
Wikipedia
Wikidata
Wikibase
since all the matched text is captured, regardless of the group (which is non-capturing).
Nested groups rule
When a capturing group contains other capturing groups—which, recursively, can contain even more capturing groups within themselves—the text corresponding to the outermost group is captured.
For example, if the text is:
He was awarded the Silver Star for military valor.
and the regular expression is:
((Gold|Silver|Purple) (Star|Cross|Heart))
the overall capture is:
Silver Star
Tip
When there are groups and sub-groups and the entire match needs to be captured, surround the regular expression with parentheses.
Consecutive groups rule
If there are multiple capturing groups at the same level in a regular expression, the overall capture will be the concatenation of what is captured by all the groups, in the order in which the groups are found. A blank character is added as a separator in the concatenation.
For example, if the text is:
AH_808_BF_915
and the regular expression is:
([A-Z]{2})_[0-9]{3}_([A-Z]{2})_[0-9]{3}
the overall capture is:
AH BF
Repetitions
Pay attention to the use of repetitions with capturing groups.
If the text is:
XXXL
the regular expression:
(X)*L
matches the entire text, but captures only:
X
because the capturing group, in fact, corresponds only to the X character, even if the repetition of "zero or more occurrences" is applied to the group using the asterisk character (*
).
In order to capture all of the Xs, simply surround the expression with parentheses to create an outer capturing group:
((X)*)L
Lookahead and lookbehind groups
The lookahead and lookbehind groups, whether positive or negative, must not be used in the regular expression syntax of the PATTERN
attribute. To obtain the same effect, use the capturing groups or sequence operators plus negations appropriately.
For example, to obtain the same effect of as an expression like: "Capture blue only if followed by sky" (positive lookahead) a lookahead group isn't necessary, just use the capturing groups like this:
(blue) sky
To obtain the effect of a negative lookahead like: "Capture dark only if NOT followed by matter", use a condition like this:
@FIELDX[PATTERN("dark")]
>>
!PATTERN("matter")
Similarly, for a positive lookbehind like: "Capture star only if preceded by red", just use the capturing groups like this:
red (star)
while for a negative lookbehind like: "Capture star only if NOT preceded by cake", you can use a condition like this:
!PATTERN("cake")
>>
@FIELDX[PATTERN("star")]
Please note that regular expressions used in .js modules are based on Duktape ES5.1. While this engine supports the use of lookahead (?=), it does not support the use of lookbehind, as lookbehind assertions, such as (?<=...) for positive lookbehind and (?<!...) for negative lookbehind, were only added in later versions of JavaScript.
Still, we can turn a negative lookbehind into a negative lookahead, provided that we state a fixed width. For example, if we want the string “foo” not to be preceded by the letter “S” within three character we can turn this:
/(?<![S] )(foo)
into this
/(?![S])(?:^.{0,2}|.{3})(foo)
Backward references
A backward reference is a reference to the text of a previously defined group.
Note
The reference is to the matched text and not to the expression itself.
A backward reference consists of the escape character \
followed by a number between 1 and 9. \1
refers to the first group,\2
to the second, and so on. For example:
(.*)-\1
matches any string which repeats itself, with a central hyphen, such as:
go-go
ha-ha
walla-walla
Comments
Comments are a special type of group. The syntax is:
(?#comment)
Comments are useful to explain complex regular expressions; they affect neither matches nor captures.