regexcleaner
Overview
The regexcleaner module allows performing find & replace operations on strings, mainly based on regular expressions.
The module supports Perl compatible regular expressions as the REX
object and the PATTERN
attribute of the rules language.
The module has these methods:
addPLAIN
addREGEX
addREGEXBackref
addREGEXSelect
addREGEXSelectBackref
repairBrokenWords
apply
load
getLastError
close
When in Studio you install the regexcleaner module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:
var regexcleaner = require('modules/regexcleaner');
The statement above sets a variable with an instance of the module so that you can use it in all event handling functions.
Preparation and execution
Use the addPLAIN
, addREGEX
, addREGEXBackref
, addREGEXSelect
and addREGEXSelectBackref
methods to prepare the find & replace operations you need, then invoke the apply
method to actually perform them.
The apply
method performs the operations in the same order in which they were prepared with the "add" methods.
Invoke the "add" methods in the initialize
function to pre-process the text before it is submitted to analysis. Then invoke the apply
method in the onPrepare
function to apply changes during the document preparation. For example:
var regexcleaner = require('modules/regexcleaner');
function initialize(cmdline) {
regexcleaner.addREGEX('sec[ry]ion', 'section');
regexcleaner.addREGEXBackref('(u){2,}','$1');
return true;
}
function onPrepare(text) {
var newText = regexcleaner.apply(text);
return newText;
}
You are nevertheless free to use the apply
method in another point of your code and with any other string.
The load
method must be used in the initialize
function, because it is the right place for the initialization of objects needed in other event handling functions.
The getLastError
method must be used in the initialize
function, because it retrieves the message corresponding to the last error that occurred when the load
method fails.
The close
method must be used in the shutdown
function, because it is used to free up the resources allocated by the regexcleaner module object.
addPLAIN
The addPLAIN
method prepares a plain find & replace operation. It's the only operation not based upon regular expressions.
Note
Remember, this and all the other "add" methods don't actually perform the find & replace operation. All the prepared operations are triggered by the apply
method.
For example, this statement:
regexcleaner.addPLAIN('bamd', 'band');
prepares an operation that replaces all occurrences of the string bamd
with the string band
.
The syntax is:
moduleVariable.addPLAIN(searchString, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.searchString
is the string to replace.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure, it sets an error message you can retrieve with the getLastError
method.
addREGEX
The addREGEX
prepares a find & replace operation in which the strings to replace are found using a regular expression.
For example:
regexcleaner.addREGEX("sec[ry]ion", "section");
When applied to the following text:
That specific secrion of the movie was really awesome.
the operation fixes the typo.
The syntax is:
moduleVariable.addREGEX(regularExpression, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression
is the regular expression used to find the strings to replace.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure, it sets an error message you can retrieve with the getLastError
method.
addREGEXBackref
Like addREGEX
, with the difference that the replacement string can contain special placeholders that are dynamically replaced with the contents of the capturing groups of the regular expression.
Consider the following operation:
regexcleaner.addREGEXBackref("([aeiou])\\1+", "$1")
When applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
the operation cuts off all of the exceeding vowels giving:
Hey dude! I love The Lord of the Rings.
Note
Remember to escape backslashes (\
) with another one when using them in regular expressions. If you use an external XML, this won't be necessary, because the backslash is escaped by the XML parser.
The regular expression has a capturing group (([aeiou])
) matching any vowel. This group is the first—and only—so it's number 1. The back reference inside the regular expression (\1
)—that must be escaped with a backslash (\
) in the string—is used to capture exactly what has already been captured by group 1, that is the same vowel, one or more times.
The result is a regular expression that captures repetitions of vowels.
The placeholder in the replacement string ($1
) is dynamically replaced with the content of capturing group number 1, that is, with the first vowel of each repetition.
Note
Any back references in the regular expression should not be confused with placeholders in the replacement string. Both refer to capturing groups, but in a different way and for different purposes. The regular expression does not necessarily have to use backward references and may not even have capturing groups, but if it does, it is possible to refer to them in the replacement string using placeholders. Furthermore, the replacement string is not a regular expression.
The name of the method may suggest that it requires the use of backward references in the regular expression, but it's not the case.
The syntax of the method is:
moduleVariable.addREGEXBackref(regularExpression, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression
is the regular expression used to find the strings to replace.replacementString
is the replacement string possibly containing capturing groups' placeholders.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
Placeholders
Placeholders referencing to capturing groups of the regular expression have the following syntax:
$capturingGroupNumber
where capturingGroupNumber
is the number of the capturing group in the regular expression.
In case of nested capturing groups, the inner groups have higher numbers. So for example in:
(super(man|woman|mouse))
the outermost group can be referenced with placeholder $1
and the innermost one with placeholder $2
.
Placeholder $0
is special since it corresponds to the capture of the entire regular expression, which is not required to be contained in parentheses.
addREGEXSelect
Like addREGEX
, but with an additional regular expression used to circumscribe the find & replace operation to selected areas of the text.
The parts of the text corresponding to this regular expression are selected, then the text to be replaced is searched within the selected parts.
Consider the following example:
regexcleaner.addREGEXSelect("(?is)song title:.*lyrics:", "(?is)\\R", " ");
and this input text:
Song title:
The
Long
And
Winding
Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
When the operation is applied you get:
Song title:
The Long And Winding Road
Lyrics:
The long and winding road
That leads to your door
Will never disappear
...
because carriage return characters are replaced with a blank, but only in the song title area.
The syntax is:
moduleVariable.addREGEXSelect(regularExpression1, regularExpression2, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression1
is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2
is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementString
is the replacement string.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
addREGEXSelectBackref
Like addREGEXSelect
, with the difference that the replacement string can contain placeholders corresponding to capturing groups, as in `addREGEXBackref'.
For example, when the operation prepared with:
regexcleaner.addREGEXSelectBackref('^(.+!)', '([aeiou])\\1+', '$1');
is applied to this text:
Heeeeeey duuuuude! I looooove The Lord of the Rings.
you get:
Hey dude! I looooove The Lord of the Rings.
As you can see, the operation is limited to the first sentence, up to the exclamation mark.
The syntax is:
moduleVariable.addREGEXSelectBackref(regularExpression1, regularExpression2, replacementString)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.regularExpression1
is a regular expression that spots the portions of text where the find & replace operation occurs.regularExpression2
is the regular expression used to find the text to replace in portions of text selected by regularexpression1.replacementString
is the replacement string possibly containing capturing groups' placeholders.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
repairBrokenWords
The repairBrokenWords
method is designed to reconstruct broken words by reading a list of words from an external source and intelligently piecing them back together when they are fragmented by spaces or similar characters (line breaks are not considered in this evaluation).
For example, if you pass a list with the word LIABILITY
to this method, it can mend broken variations like L I A B I L I T Y
or L IAB ILIT Y
and restore them to their original form.
The syntax of the method is:
moduleVariable.repairBrokenWords(listPath[, caseInsensitiveFlag, charactersAccepetedBetweenLetters])
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.listPath
specifies the path to the external list containing the words for repair, starting from therules
folder in your project. The list format will be explained further below.caseInsensitiveFlag
(optional, defaults to false) when set to true, enables a case-insensitive match for the broken words.replacementString
(optional) is an array containing the characters accepted between individual letters of the broken words. For example,[" ", "."]
matches blank spaces and full stops. If not declared, only spaces are considered between the letters.
The external list supports empty lines and comment lines that start with //
.
For example, when the operation is prepared with:
regexcleaner.repairBrokenWords("regexcleaner/broken_words.cl", true, [" ", "."]);
and the list (saved at the path rules/regexcleaner/broken_words.cl
) contains:
// Insurance-related words
Insured
Reinsured
Assured
if the input text includes the line:
NAME OF THE A SS UR E D
this will be corrected to
NAME OF THE ASSURED
It's important to note that the method only considers complete words, so a string like:
NAME OF THEAS SURE D
won't be adjusted, as "ASSURED" is not a complete word separated by boundaries.
load
The load
method prepares find & replace operations—similarly to what you can do by invoking "add" methods—using as its source a configuration file generated when importing a project created with a legacy edition of Studio.
As mentioned above, Studio uses Perl compatible regular expressions and no flag is set by default.
For example, the character .
does not match carriage returns. To match them, use the flag (?.)
.
As for "add" methods, you must then invoke the apply
method to perform the prepared operations.
Warning
The use of the load
method is not required in cases other than those indicated above and the import procedure already generates the appropriate statements inside the main.jr
file, so there are basically no cases in which you have to write code that uses this method.
For example, when importing an old project Studio may generate this code:
var regexcleaner = require("modules/regexcleaner");
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml')) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
function onPrepare(text) {
text= regexcleaner.apply(text);
return text;
}
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.load(configFilePath)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.configFilePath
is the path of the configuration file path generated when importing the old technology project.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
apply
The apply
method performs all the operations prepared with invocations of "add" methods or with the invocation of the load
method.
For example:
function onPrepare(text) {
return regexcleaner.apply(text);
}
The syntax is:
moduleVariable.apply(string)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.string
is the string on which to perform the find & replace operations.
The method returns the modified string.
getLastError
The getLastError
method retrieves the message corresponding to the last error that occurred when an 'add' method or the load
method fail. Use it to display the error message.
For example:
function initialize(cmdline) {
if (!regexcleaner.load('Config.xml')) {
CONSOLE.error(regexcleaner.getLastError());
return false;
}
}
The syntax is:
moduleVariable.getLastError()
where moduleVariable is the variable corresponding to the module and set with require()
.
close
The close
method is used to free up the resources allocated by the regexcleaner module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown
function.
For example:
function shutdown() {
regexcleaner.close();
}
The syntax is:
moduleVariable.close()
where moduleVariable
is the variable corresponding to the module and set with require()
.