regexcleaner
Overview
The regexcleaner module allows performing find & replace operations on strings, mainly based on regular expressions.
The module supports Perl compatible regular expressions as the REX
object and the PATTERN
attribute of the rules language.
The module has these methods:
addPLAIN
addREGEX
addREGEXBackref
addREGEXSelect
addREGEXSelectBackref
apply
load
getLastError
close
When in Studio you install the regexcleaner module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:
var regexcleaner = require('modules/regexcleaner');
The statement above sets a variable with an instance of the module so that you can use it in all event handling functions.
Preparation and execution
Use the addPLAIN
, addREGEX
, addREGEXBackref
, addREGEXSelect
and addREGEXSelectBackref
methods to prepare the find & replace operations you need, then invoke the apply
method to actually perform them.
The apply
method performs the operations in the same order in which they were prepared with the "add" methods.
Invoke the "add" methods in the initialize
function. Then, in the most common use case, which is the pre-processing of the input document text, invoke the apply
method in the onPrepare
function. For example:
var regexcleaner = require('modules/regexcleaner');
function initialize(cmdline) {
regexcleaner.addREGEX('sec[ry]ion', 'section');
regexcleaner.addREGEXBackref('(u){2,}','$1');
return true;
}
function onPrepare(text) {
var newText = regexcleaner.apply(text);
return newText;
}
You are nevertheless free to use the apply
method in another point of your code and with any other string.
addPLAIN
The addPLAIN
method prepares a plain find & replace operation. It's the only operation not based upon regular expressions.
Note
Remember, this and all the other "add" methods don't actually perform the find & replace operation. All the prepared operations are triggered by the apply
method.
For example, this statement:
regexcleaner.addPLAIN('bamd', 'band');
prepares an operation that replaces all occurrences of the string bamd
with the string band
.
The syntax is:
moduleVariable.addPLAIN(searchString, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.searchString
is the string to replace.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure, it sets an error message you can retrieve with the getLastError
method.
addREGEX
The addREGEX
prepares a find & replace operation in which the strings to replace are found using a regular expression.
For example:
regexcleaner.addREGEX("sec[ry]ion", "section");
When applied to the following text:
That specific secrion of the movie was really awesome.
the operation fixes the typo.
The syntax is:
moduleVariable.addREGEX(regularExpression, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression
is the regular expression used to find the strings to replace.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure, it sets an error message you can retrieve with the getLastError
method.
addREGEXBackref
Like addREGEX
, with the difference that the replacement string can contain special placeholders that are dynamically replaced with the contents of the capturing groups of the regular expression.
Consider the following operation:
regexcleaner.addREGEXBackref("([aeiou])\\1+", "$1")
When applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
the operation cuts off all of the exceeding vowels giving:
Hey dude! I love The Lord of the Rings.
Note
Remember to escape backslashes (\
) with another one when using them in regular expressions. If you use an external XML, this won't be necessary, because the backslash is escaped by the XML parser.
The regular expression has a capturing group (([aeiou])
) matching any vowel. This group is the first—and only—so it's number 1. The back reference inside the regular expression (\1
)—that must be escaped with a backslash (\
) in the string—is used to capture exactly what has already been captured by group 1, that is the same vowel, one or more times.
The result is a regular expression that captures repetitions of vowels.
The placeholder in the replacement string ($1
) is dynamically replaced with the content of capturing group number 1, that is, with the first vowel of each repetition.
Note
Any back references in the regular expression should not be confused with placeholders in the replacement string. Both refer to capturing groups, but in a different way and for different purposes. The regular expression does not necessarily have to use backward references and may not even have capturing groups, but if it does, it is possible to refer to them in the replacement string using placeholders. Furthermore, the replacement string is not a regular expression.
The name of the method may suggest that it requires the use of backward references in the regular expression, but it's not the case.
The syntax of the method is:
moduleVariable.addREGEXBackref(regularExpression, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression
is the regular expression used to find the strings to replace.replacementString
is the replacement string possibly containing capturing groups' placeholders.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
Placeholders
Placeholders referencing to capturing groups of the regular expression have the following syntax:
$capturingGroupNumber
where capturingGroupNumber
is the number of the capturing group in the regular expression.
In case of nested capturing groups, the inner groups have higher numbers. So for example in:
(super(man|woman|mouse))
the outermost group can be referenced with placeholder $1
and the innermost one with placeholder $2
.
Placeholder $0
is special since it corresponds to the capture of the entire regular expression, which is not required to be contained in parentheses.
addREGEXSelect
Like addREGEX
, but with an additional regular expression used to circumscribe the find & replace operation to selected areas of the text.
The parts of the text corresponding to this regular expression are selected, then the text to be replaced is searched within the selected parts.
Consider the following example:
regexcleaner.addREGEXSelect("(?is)song title:.*lyrics:", "(?is)\\R", " ");
and this input text:
Song title:
The
Long
And
Winding
Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
When the operation is applied you get:
Song title:
The Long And Winding Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
because carriage return characters are replaced with a blank, but only in the song title area.
The syntax is:
moduleVariable.addREGEXSelect(regularExpression1, regularExpression2, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression1
is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2
is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
addREGEXSelectBackref
Like addREGEXSelect
, with the difference that the replacement string can contain placeholders corresponding to capturing groups, as in `addREGEXBackref'.
For example, when the operation prepared with:
regexcleaner.addREGEXSelectBackref('^(.+!)', '([aeiou])\\1+', '$1');
is applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
you get:
Hey dude! I looooove The Lord of the Rings.
As you can see, the operation is limited to the first sentence, up to the exclamation mark.
The syntax is:
moduleVariable.addREGEXSelectBackref(regularExpression1, regularExpression2, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression1
is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2
is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementString
is the replacement string possibly containing capturing groups' placeholders.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
load
The load
method prepares find & replace operations—similarly to what you can do by invoking "add" methods—using as its source a configuration file generated when importing a project created with a legacy edition of Studio.
As mentioned above, Studio uses Perl compatible regular expressions and no flag is set by default.
For example, the character .
does not match carriage returns. To match them, use the flag (?.)
.
As for "add" methods, you must then invoke the apply
method to perform the prepared operations.
Warning
The use of the load
method is not required in cases other than those indicated above and the import procedure already generates the appropriate statements inside the main.jr
file, so there are basically no cases in which you have to write code that uses this method.
For example, when importing an old project Studio may generate this code:
var regexcleaner = require("modules/regexMod");
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml'))) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
function onPrepare(text) {
text= regexcleaner.apply(text);
return text;
}
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.load(configFilePath)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.configFilePath
is the path of the configuration file path generated when importing the old technology project.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
apply
The apply
method performs all the operations prepared with invocations of "add" methods or with the invocation of the load
method.
For example:
function onPrepare(text) {
return regexcleaner.apply(text);
}
The syntax is:
moduleVariable.apply(string)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.string
is the string on which to perform the find & replace operations.
The method returns the modified string.
getLastError
The getLastError
method retrieves the message corresponding to the last error that occurred when an 'add' method or the load
method fail. Use it to display the error message.
For example:
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml'))) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
The syntax is:
moduleVariable.getLastError()
where moduleVariable is the variable corresponding to the module and set with require()
.
close
The close
method is used to free up the resources allocated by the regexcleaner module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown
function.
For example:
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.close()
where moduleVariable
is the variable corresponding to the module and set with require()
.
Debug mode
It is possible to activate the debug mode for this module to check if your regular expressions are working before using a method, by checking these variables in the regexcleaner.jr file:
var debug_mode = false;
var only_debug_regex = true;
var max_char_length = 500;
- Set the first variable to true.
- Optionally change the maximum number of characters to be analyzed (500 by default).
If you consider the example described in addRegexSELECTBackref, you will find the following notifications in the Console tool window: