regexcleaner
Overview
The regexcleaner module allows performing find & replace operations on strings, mainly based on regular expressions.
The module supports Perl compatible regular expressions as the REX object and the PATTERN attribute of the rules language.
The module has these methods:
addPLAINaddREGEXaddREGEXBackrefaddREGEXSelectaddREGEXSelectBackrefrepairBrokenWordsapplyloadgetLastErrorclose
When in Studio you install the regexcleaner module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:
var regexcleaner = require('modules/regexcleaner');
The statement above sets a variable with an instance of the module so that you can use it in all event handling functions.
Preparation and execution
Use the addPLAIN, addREGEX, addREGEXBackref, addREGEXSelect and addREGEXSelectBackref methods to prepare the find & replace operations you need, then invoke the apply method to actually perform them.
The apply method performs the operations in the same order in which they were prepared with the "add" methods.
Invoke the "add" methods in the initialize function to pre-process the text before it is submitted to analysis. Then invoke the apply method in the onPrepare function to apply changes during the document preparation. For example:
var regexcleaner = require('modules/regexcleaner');
function initialize(cmdline) {
regexcleaner.addREGEX('sec[ry]ion', 'section');
regexcleaner.addREGEXBackref('(u){2,}','$1');
return true;
}
function onPrepare(text) {
var newText = regexcleaner.apply(text);
return newText;
}
You are nevertheless free to use the apply method in another point of your code and with any other string.
The load method must be used in the initialize function, because it is the right place for the initialization of objects needed in other event handling functions.
The getLastError method must be used in the initialize function, because it retrieves the message corresponding to the last error that occurred when the load method fails.
The close method must be used in the shutdown function, because it is used to free up the resources allocated by the regexcleaner module object.
addPLAIN
The addPLAIN method prepares a plain find & replace operation. It's the only operation not based upon regular expressions.
Note
Remember, this and all the other "add" methods don't actually perform the find & replace operation. All the prepared operations are triggered by the apply method.
For example, this statement:
regexcleaner.addPLAIN('bamd', 'band');
prepares an operation that replaces all occurrences of the string bamd with the string band.
The syntax is:
moduleVariable.addPLAIN(searchString, replacementString)
where:
moduleVariableis the variable corresponding to the module and set withrequire().searchStringis the string to replace.replacementStringis the replacement string.
The method returns true in case of success, false otherwise. In case of failure, it sets an error message you can retrieve with the getLastError method.
addREGEX
The addREGEX prepares a find & replace operation in which the strings to replace are found using a regular expression.
For example:
regexcleaner.addREGEX("sec[ry]ion", "section");
When applied to the following text:
That specific secrion of the movie was really awesome.
the operation fixes the typo.
The syntax is:
moduleVariable.addREGEX(regularExpression, replacementString)
where:
moduleVariableis the variable corresponding to the module and set withrequire().regularExpressionis the regular expression used to find the strings to replace.replacementStringis the replacement string.
The method returns true in case of success, false otherwise. In case of failure, it sets an error message you can retrieve with the getLastError method.
addREGEXBackref
Like addREGEX, with the difference that the replacement string can contain special placeholders that are dynamically replaced with the contents of the capturing groups of the regular expression.
Consider the following operation:
regexcleaner.addREGEXBackref("([aeiou])\\1+", "$1")
When applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
the operation cuts off all of the exceeding vowels giving:
Hey dude! I love The Lord of the Rings.
Note
Remember to escape backslashes (\) with another one when using them in regular expressions. If you use an external XML, this won't be necessary, because the backslash is escaped by the XML parser.
The regular expression has a capturing group (([aeiou])) matching any vowel. This group is the first—and only—so it's number 1. The back reference inside the regular expression (\1)—that must be escaped with a backslash (\) in the string—is used to capture exactly what has already been captured by group 1, that is the same vowel, one or more times.
The result is a regular expression that captures repetitions of vowels.
The placeholder in the replacement string ($1) is dynamically replaced with the content of capturing group number 1, that is, with the first vowel of each repetition.
Note
Any back references in the regular expression should not be confused with placeholders in the replacement string. Both refer to capturing groups, but in a different way and for different purposes. The regular expression does not necessarily have to use backward references and may not even have capturing groups, but if it does, it is possible to refer to them in the replacement string using placeholders. Furthermore, the replacement string is not a regular expression.
The name of the method may suggest that it requires the use of backward references in the regular expression, but it's not the case.
The syntax of the method is:
moduleVariable.addREGEXBackref(regularExpression, replacementString)
where:
moduleVariableis the variable corresponding to the module and set withrequire().regularExpressionis the regular expression used to find the strings to replace.replacementStringis the replacement string possibly containing capturing groups' placeholders.
The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.
Placeholders
Placeholders referencing to capturing groups of the regular expression have the following syntax:
$capturingGroupNumber
where capturingGroupNumber is the number of the capturing group in the regular expression.
In case of nested capturing groups, the inner groups have higher numbers. So for example in:
(super(man|woman|mouse))
the outermost group can be referenced with placeholder $1 and the innermost one with placeholder $2.
Placeholder $0 is special since it corresponds to the capture of the entire regular expression, which is not required to be contained in parentheses.
addREGEXSelect
Like addREGEX, but with an additional regular expression used to circumscribe the find & replace operation to selected areas of the text.
The parts of the text corresponding to this regular expression are selected, then the text to be replaced is searched within the selected parts.
Consider the following example:
regexcleaner.addREGEXSelect("(?is)song title:.*lyrics:", "(?is)\\R", " ");
and this input text:
Song title:
The
Long
And
Winding
Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
When the operation is applied you get:
Song title:
The Long And Winding Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
because carriage return characters are replaced with a blank, but only in the song title area.
The syntax is:
moduleVariable.addREGEXSelect(regularExpression1, regularExpression2, replacementString)
where:
moduleVariableis the variable corresponding to the module and set withrequire().regularExpression1is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementStringis the replacement string.
The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.
addREGEXSelectBackref
Like addREGEXSelect, with the difference that the replacement string can contain placeholders corresponding to capturing groups, as in `addREGEXBackref'.
For example, when the operation prepared with:
regexcleaner.addREGEXSelectBackref('^(.+!)', '([aeiou])\\1+', '$1');
is applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
you get:
Hey dude! I looooove The Lord of the Rings.
As you can see, the operation is limited to the first sentence, up to the exclamation mark.
The syntax is:
moduleVariable.addREGEXSelectBackref(regularExpression1, regularExpression2, replacementString)
where:
moduleVariableis the variable corresponding to the module and set withrequire().regularExpression1is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementStringis the replacement string possibly containing capturing groups' placeholders.
The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.
repairBrokenWords
The repairBrokenWords method is designed to reconstruct broken words by reading a list of words from an external source and intelligently piecing them back together when they are fragmented by spaces or similar characters (line breaks are not considered in this evaluation).
For example, if you pass a list with the word LIABILITY to this method, it can mend broken variations like L I A B I L I T Y or L IAB ILIT Y and restore them to their original form.
The syntax of the method is:
moduleVariable.repairBrokenWords(listPath[, caseInsensitiveFlag, charactersAccepetedBetweenLetters])
where:
moduleVariableis the variable corresponding to the module and set withrequire().listPathspecifies the path to the external list containing the words for repair, starting from therulesfolder in your project. The list format will be explained further below.caseInsensitiveFlag(optional, defaults to false) when set to true, enables a case-insensitive match for the broken words.replacementString(optional) is an array containing the characters accepted between individual letters of the broken words. For example,[" ", "."]matches blank spaces and full stops. If not declared, only spaces are considered between the letters.
The external list supports empty lines and comment lines that start with //.
For example, when the operation is prepared with:
regexcleaner.repairBrokenWords("regexcleaner/broken_words.cl", true, [" ", "."]);
and the list (saved at the path rules/regexcleaner/broken_words.cl) contains:
// Insurance-related words
Insured
Reinsured
Assured
if the input text includes the line:
NAME OF THE A SS UR E D
this will be corrected to
NAME OF THE ASSURED
It's important to note that the method only considers complete words, so a string like:
NAME OF THEAS SURE D
won't be adjusted, as "ASSURED" is not a complete word separated by boundaries.
load
The load method prepares find & replace operations—similarly to what you can do by invoking "add" methods—using as its source a configuration file generated when importing a project created with a legacy edition of Studio.
As mentioned above, Studio uses Perl compatible regular expressions and no flag is set by default.
For example, the character . does not match carriage returns. To match them, use the flag (?.).
As for "add" methods, you must then invoke the apply method to perform the prepared operations.
Warning
The use of the load method is not required in cases other than those indicated above and the import procedure already generates the appropriate statements inside the main.jr file, so there are basically no cases in which you have to write code that uses this method.
For example, when importing an old project Studio may generate this code:
var regexcleaner = require("modules/regexcleaner");
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml')) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
function onPrepare(text) {
text= regexcleaner.apply(text);
return text;
}
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.load(configFilePath)
where:
moduleVariableis the variable corresponding to the module and set withrequire().configFilePathis the path of the configuration file path generated when importing the old technology project.
The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.
apply
The apply method performs all the operations prepared with invocations of "add" methods or with the invocation of the load method.
For example:
function onPrepare(text) {
return regexcleaner.apply(text);
}
The syntax is:
moduleVariable.apply(string)
where:
moduleVariableis the variable corresponding to the module and set withrequire().stringis the string on which to perform the find & replace operations.
The method returns the modified string.
getLastError
The getLastError method retrieves the message corresponding to the last error that occurred when an 'add' method or the load method fail. Use it to display the error message.
For example:
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml')) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
The syntax is:
moduleVariable.getLastError()
where moduleVariable is the variable corresponding to the module and set with require().
close
The close method is used to free up the resources allocated by the regexcleaner module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown function.
For example:
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.close()
where moduleVariable is the variable corresponding to the module and set with require().