regexcleaner

Overview

The regexcleaner module allows performing find & replace operations on strings, mainly based on regular expressions.

The module supports Perl compatible regular expressions as the REX object and the PATTERN attribute of the rules language.

The module has these methods:

addPLAIN
addREGEX
addREGEXBackref
addREGEXSelect
addREGEXSelectBackref
repairBrokenWords
apply
load
getLastError
close

When in Studio you install the regexcleaner module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:

var regexcleaner = require('modules/regexcleaner');

The statement above sets a variable with an instance of the module so that you can use it in all event handling functions.

Preparation and execution

Use the addPLAIN, addREGEX, addREGEXBackref, addREGEXSelect and addREGEXSelectBackref methods to prepare the find & replace operations you need, then invoke the apply method to actually perform them.
The apply method performs the operations in the same order in which they were prepared with the "add" methods.

Invoke the "add" methods in the initialize function to pre-process the text before it is submitted to analysis. Then invoke the apply method in the onPrepare function to apply changes during the document preparation. For example:

var regexcleaner = require('modules/regexcleaner');

function initialize(cmdline) {
    regexcleaner.addREGEX('sec[ry]ion', 'section');
    regexcleaner.addREGEXBackref('(u){2,}','$1');

    return true;
}

function onPrepare(text) {
    var newText = regexcleaner.apply(text);
    return newText;
}

You are nevertheless free to use the apply method in another point of your code and with any other string.

The load method must be used in the initialize function, because it is the right place for the initialization of objects needed in other event handling functions.

The getLastError method must be used in the initialize function, because it retrieves the message corresponding to the last error that occurred when the load method fails.

The close method must be used in the shutdown function, because it is used to free up the resources allocated by the regexcleaner module object.

addPLAIN

The addPLAIN method prepares a plain find & replace operation. It's the only operation not based upon regular expressions.

Note

Remember, this and all the other "add" methods don't actually perform the find & replace operation. All the prepared operations are triggered by the apply method.

For example, this statement:

regexcleaner.addPLAIN('bamd', 'band');

prepares an operation that replaces all occurrences of the string bamd with the string band.

The syntax is:

moduleVariable.addPLAIN(searchString, replacementString)

where:

moduleVariable is the variable corresponding to the module and set with require().
searchString is the string to replace.
replacementString is the replacement string.

The method returns true in case of success, false otherwise. In case of failure, it sets an error message you can retrieve with the getLastError method.

addREGEX

The addREGEX prepares a find & replace operation in which the strings to replace are found using a regular expression.

For example:

regexcleaner.addREGEX("sec[ry]ion", "section");

When applied to the following text:

That specific secrion of the movie was really awesome.

the operation fixes the typo.

The syntax is:

moduleVariable.addREGEX(regularExpression, replacementString)

where:

moduleVariable is the variable corresponding to the module and set with require().
regularExpression is the regular expression used to find the strings to replace.
replacementString is the replacement string.

The method returns true in case of success, false otherwise. In case of failure, it sets an error message you can retrieve with the getLastError method.

addREGEXBackref

Like addREGEX, with the difference that the replacement string can contain special placeholders that are dynamically replaced with the contents of the capturing groups of the regular expression.

Consider the following operation:

regexcleaner.addREGEXBackref("([aeiou])\\1+", "$1")

When applied to this text:

Heeeeeey duuuuude! I looooove The Lord of the Rings.

the operation cuts off all of the exceeding vowels giving:

Hey dude! I love The Lord of the Rings.

Note

Remember to escape backslashes (\) with another one when using them in regular expressions. If you use an external XML, this won't be necessary, because the backslash is escaped by the XML parser.

The regular expression has a capturing group (([aeiou])) matching any vowel. This group is the first—and only—so it's number 1. The back reference inside the regular expression (\1)—that must be escaped with a backslash (\) in the string—is used to capture exactly what has already been captured by group 1, that is the same vowel, one or more times.
The result is a regular expression that captures repetitions of vowels.
The placeholder in the replacement string ($1) is dynamically replaced with the content of capturing group number 1, that is, with the first vowel of each repetition.

Note

Any back references in the regular expression should not be confused with placeholders in the replacement string. Both refer to capturing groups, but in a different way and for different purposes. The regular expression does not necessarily have to use backward references and may not even have capturing groups, but if it does, it is possible to refer to them in the replacement string using placeholders. Furthermore, the replacement string is not a regular expression.
The name of the method may suggest that it requires the use of backward references in the regular expression, but it's not the case.

The syntax of the method is:

moduleVariable.addREGEXBackref(regularExpression, replacementString)

where:

moduleVariable is the variable corresponding to the module and set with require().
regularExpression is the regular expression used to find the strings to replace.
replacementString is the replacement string possibly containing capturing groups' placeholders.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

Placeholders

Placeholders referencing to capturing groups of the regular expression have the following syntax:

$capturingGroupNumber

where capturingGroupNumber is the number of the capturing group in the regular expression.

In case of nested capturing groups, the inner groups have higher numbers. So for example in:

(super(man|woman|mouse))

the outermost group can be referenced with placeholder $1 and the innermost one with placeholder $2.

Placeholder $0 is special since it corresponds to the capture of the entire regular expression, which is not required to be contained in parentheses.

addREGEXSelect

Like addREGEX, but with an additional regular expression used to circumscribe the find & replace operation to selected areas of the text.
The parts of the text corresponding to this regular expression are selected, then the text to be replaced is searched within the selected parts.

Consider the following example:

regexcleaner.addREGEXSelect("(?is)song title:.*lyrics:", "(?is)\\R", " ");

and this input text:

Song title:
The
Long
And
Winding
Road

Lyrics:
The long and winding road
That leads to your door
Will never disappear
...

When the operation is applied you get:

Song title:
The Long And Winding Road

Lyrics:
The long and winding road
That leads to your door
Will never disappear
...

because carriage return characters are replaced with a blank, but only in the song title area.

The syntax is:

moduleVariable.addREGEXSelect(regularExpression1, regularExpression2, replacementString)

where:

moduleVariable is the variable corresponding to the module and set with require().
regularExpression1 is a regular expression that spots the portions of text where the find & replace operation occurs.
regularExpression2 is the regular expression used to find the text to replace in portions of text selected by regularexpression1.
replacementString is the replacement string.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

addREGEXSelectBackref

Like addREGEXSelect, with the difference that the replacement string can contain placeholders corresponding to capturing groups, as in `addREGEXBackref'.

For example, when the operation prepared with:

regexcleaner.addREGEXSelectBackref('^(.+!)', '([aeiou])\\1+', '$1');

is applied to this text:

Heeeeeey duuuuude! I looooove The Lord of the Rings.

you get:

Hey dude! I looooove The Lord of the Rings.

As you can see, the operation is limited to the first sentence, up to the exclamation mark.

The syntax is:

moduleVariable.addREGEXSelectBackref(regularExpression1, regularExpression2, replacementString)

where:

moduleVariable is the variable corresponding to the module and set with require().
regularExpression1 is a regular expression that spots the portions of text where the find & replace operation occurs.
regularExpression2 is the regular expression used to find the text to replace in portions of text selected by regularexpression1.
replacementString is the replacement string possibly containing capturing groups' placeholders.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

repairBrokenWords

The repairBrokenWords method is designed to reconstruct broken words by reading a list of words from an external source and intelligently piecing them back together when they are fragmented by spaces or similar characters (line breaks are not considered in this evaluation).

For example, if you pass a list with the word LIABILITY to this method, it can mend broken variations like L I A B I L I T Y or L IAB ILIT Y and restore them to their original form.

The syntax of the method is:

moduleVariable.repairBrokenWords(listPath[, caseInsensitiveFlag, charactersAccepetedBetweenLetters])

where:

moduleVariable is the variable corresponding to the module and set with require().
listPath specifies the path to the external list containing the words for repair, starting from the rules folder in your project. The list format will be explained further below.
caseInsensitiveFlag (optional, defaults to false) when set to true, enables a case-insensitive match for the broken words.
replacementString (optional) is an array containing the characters accepted between individual letters of the broken words. For example, [" ", "."] matches blank spaces and full stops. If not declared, only spaces are considered between the letters.

The external list supports empty lines and comment lines that start with //.

For example, when the operation is prepared with:

regexcleaner.repairBrokenWords("regexcleaner/broken_words.cl", true, [" ", "."]);

and the list (saved at the path rules/regexcleaner/broken_words.cl) contains:

// Insurance-related words
Insured
Reinsured

Assured

if the input text includes the line:

NAME OF THE A SS UR E D

this will be corrected to

NAME OF THE ASSURED

It's important to note that the method only considers complete words, so a string like:

NAME OF THEAS SURE D

won't be adjusted, as "ASSURED" is not a complete word separated by boundaries.

load

The load method prepares find & replace operations—similarly to what you can do by invoking "add" methods—using as its source a configuration file generated when importing a project created with a legacy edition of Studio.

As mentioned above, Studio uses Perl compatible regular expressions and no flag is set by default.

For example, the character . does not match carriage returns. To match them, use the flag (?.).

As for "add" methods, you must then invoke the apply method to perform the prepared operations.

Warning

The use of the load method is not required in cases other than those indicated above and the import procedure already generates the appropriate statements inside the main.jr file, so there are basically no cases in which you have to write code that uses this method.

For example, when importing an old project Studio may generate this code:

var regexcleaner = require("modules/regexcleaner");

function initialize(cmdline) {
    if (!regexcleaner.load('Config.xml')) {
        CONSOLE.error(regexcleaner.getLastError());
        return false;
    }
}

function onPrepare(text) {
    text= regexcleaner.apply(text);
    return text;
}

function shutdown() {
    regexcleaner.close();
}

The syntax is:

moduleVariable.load(configFilePath)

where:

moduleVariable is the variable corresponding to the module and set with require().
configFilePath is the path of the configuration file path generated when importing the old technology project.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

apply

The apply method performs all the operations prepared with invocations of "add" methods or with the invocation of the load method.

For example:

function onPrepare(text) {
    return regexcleaner.apply(text);
}

The syntax is:

moduleVariable.apply(string)

where:

moduleVariable is the variable corresponding to the module and set with require().
string is the string on which to perform the find & replace operations.

The method returns the modified string.

getLastError

The getLastError method retrieves the message corresponding to the last error that occurred when an 'add' method or the load method fail. Use it to display the error message.

For example:

function initialize(cmdline) {
    if (!regexcleaner.load('Config.xml')) {
        CONSOLE.error(regexcleaner.getLastError());
        return false;
    }
}

The syntax is:

moduleVariable.getLastError()

where moduleVariable is the variable corresponding to the module and set with require().

close

The close method is used to free up the resources allocated by the regexcleaner module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown function.

For example:

function shutdown() {
    regexcleaner.close();
}

The syntax is:

moduleVariable.close()

where moduleVariable is the variable corresponding to the module and set with require().