Skip to content

onPrepare

Introduction

Scripting can be used to pre-process the text before it is submitted to analysis.

Possible use cases for text pre-processing are:

  • Wrong or repeated punctuation marks removal.
  • Unnecessary white space—blanks, newlines—removal.
  • Systematic OCR errors correction.
  • Upper casing or lower casing of alphabetic characters.
  • Special or unwanted characters and words removal, for example HTML tags.
  • Number words to numeric form and viceversa conversion.
  • Emoticons to words conversion.
  • Emojis to words conversion.
  • Chat words conversion.

Text pre-processing can be accomplished using one or both of the following:

It must take place in the onPrepare event handler function which is executed when the Prepare event is fired, that is after document preparation and immediately before the text is submitted to the analysis.

The text argument of the onPrepare function contains the original text, pre-processing consists in changing it when appropriate and returning the modified value. The subsequent analysis phase takes the returned value by the function as its input text.

function onPrepare(text) {

    // Put here statements that change the value of text when appropriate

    return text;
}

Find more information on document preparation and the differences between run time environment and Studio in the Studio user manual article about the topic.

Simple string manipulation

For simple text manipulation operations you can use the properties and methods that all objects of type string have in the reference standard specification. Below there are examples of the most commonly used features.

Note

The position of characters and sub-strings in a string is zero based, meaning that the position of the first character in a string is 0, that of the second character is 1 and so on. This way, for example, the position of the last character in a string is the length of the string minus 1.

Determine string length

Every string object has a length property. It is used to count the number of characters in the string.

For example, if the text argument of the onPrepare function is set to Hello world!, after this statement:

var inputTextLength = text.length;

the value of variable inputTextLenght will be 12.

If you declare your variable inside the function, it won't be available for all the other functions.

To make your variable available for the other functions, declare it globally, like this:

var inputTextLength = 0;

then assign the correct value inside the function:

    inputTextLength = text.length;

Extract a character

To find and extract a character in a string, use the charAt() method which takes the character position within the string as its argument.

For example, if the text argument of the onPrepare function is set to Hello world!, after this statement:

var sixthChar = text.charAt(6);

the value of variable sixthChar will be w.

Find a sub-string

To find a sub-string within a string, use the indexOf() method that has the sub-string to find as its argument.

For example, if the text argument of the onPrepare function is set to Hello world!, after this statement:

var mySubStrPos = text.indexOf("world");
the value of variable mySubStrPos will be 6, which is the position of world inside the value of text.

If the substring is not found, the function will return -1.

Replace a sub-string

To replace a sub-string, use the replace() method.

For example, if the text argument of the onPrepare function is set to Hello world!, after this statement:

var newText = text.replace("world", "moon");

the value of variable newText will be Hello moon!.

Note

The replace() method only replaces the first occurrence of the sub-string.

Warning

  • Characters with a special meaning in regular expressions (for example .) must be escaped with a backslash (\).
  • You can alternatively use regular expressions instead of strings.

Change case

To change the case of a string to lowercase, use the toLowerCase() method.

For example, if the text argument of the onPrepare function is set to Hello world!, after this statement:

var myLowTxt = text.toLowerCase();

the value of variable myLowTxt will be hello world!.

Similarly, to change the case to uppercase, use the toUpperCase() method.