Skip to content

normalizepost

Overview

normalizepost is a scripting module providing transformation and normalization features for extraction records.

The available methods for this module are:

  • FIELD_NORM_STRING
  • FIELD_NORM_NUMERIC
  • RENEXTRA
  • REPLACEFIELD
  • SPLIT_NORM_LIST_REPLACE
  • SPLIT_NORM_LIST_REPLACE_REGEX
  • load
  • apply
  • getLastError
  • close

When in Studio you install the normalizepost module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:

var normalizepost = require("modules/normalizepost");

The statement above sets a variable with an instance of the module so that you can use it inside event handling functions.

All methods except for load, getLastError and close must be used in the onFinalize function, because they act on the analysis results available when this function is run.

The load method must be used in the initialize function, because it is the right place for the initialization of objects needed in other event handling functions.

The getLastError method must be used in the initialize function, because it retrieves the message corresponding to the last error that occurred when the load method fails.

The close method must be used in the shutdown function, because it is used to free up the resources allocated by the normalizepost module object.

FIELD_NORM_STRING

Purpose and syntax

FIELD_NORM_STRING normalizes fields values in various ways.

The syntax is:

moduleVariable.FIELD_NORM_STRING(result, templateName, fieldName, normalizationType, normalizationRules)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • result is the object containing the analysis results.
  • templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
  • fieldName is the field name.
  • normalizationType is the normalization type. It's a string that can be:

    • CASE
    • BOOL
    • REPLACE
    • REPLACE_REGEX
    • REPLACE_REGEX_DEBUG_MODE

    Normalization types are described below.

  • normalizationRules is the specification of the normalization and its possible values vary based on the value of normalizationType (see below).

Normalization types and formats

CASE

CASE normalization applies a specific letter case to field values.
Parameter normalizationRules must be a string with one of these values:

  • UCASE: all words are upper-cased
  • LCASE: all words are lower-cased
  • TCASE: simplified title case, all words are capitalized

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "NAME", "CASE", "TCASE");
    return result;
}

capitalizes all the words in the value of each occurrence of field NAME in all the PRESONAL_DATA template records.

BOOL

BOOL normalization replaces values like yes and no with alternative text.
If the alternative text is empty, the field is deleted. If, as consequence of the replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
Parameter normalizationRules must be a string with this syntax:

alternativeForYes\alternativeForNo

where alternativeForYes is the replacement for "yes" values and alternativeForNo is the replacement for "no" values.

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "*", "Answer", "BOOL", "TRUE\\FALSE");
    return result;
}

changes all "yes" values to TRUE and all "no" values to FALSE for all the instances of field Answer occurring in any record of any template.

The method recognizes "yes" and "no" values written in English, Spanish, French, German and Italian, no matter which the language of the analysis project is.

REPLACE

REPLACE normalization replaces a given field value with another.

If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

Parameter normalizationRules can be:

  • An object with this structure:

    {
        valueToReplace: newValue
    }
    

    where valueToReplace is the value to replace. Its value must be in lower case and the match between it and the fields' values is case insensitive and newValue is the replacement value.

Or:

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE", {"software engineer": "programmer"});
    return result;
}

replaces with programmer all the occurrences of software engineer, regardless of the letter case, for the Job field in all the PERSONAL_DATA template records.

REPLACE_REGEX

REPLACE_REGEX normalization replaces any occurrence of a given JavaScript regular expression with an alternative text which can possibly contain reference to capturing groups like $1, $2, etc.

If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

Parameter normalizationRules can be:

  • An array of objects object with this structure:

    {
        regexp: regularExpression,
        value: replacementText
    }
    

    where regularExpression is the JavaScript regular expression used to find the text to replace and replacementText is the replacement text.

Or:

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "CRIMINAL", "FaceFeature", "REPLACE_REGEX", [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]);
    return result;
} 

replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records.

REPLACE_REGEX_DEBUG_MODE

REPLACE_REGEX_DEBUG_MODE acts like REPLACE_REGEX but it also activates a debug mode to check if your regular expressions are working.

This code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "CRIMINAL", "FaceFeature", "REPLACE_REGEX_DEBUG_MODE", [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]);
    return result;
} 

replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records. You can see notifications about your regular expressions from the Console tool window:

You can also activate this mode using REPLACE_REGEX but adding a string after the normalizationRules parameter, like this:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, "CRIMINAL", "FaceFeature", "REPLACE_REGEX", [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}], "debug mode");
    return result;
}

This string can be:

  • true or debug mode: specify all strings transformed by the regular expressions.
  • verbose debug mode: specify all strings transformed and untransformed by the regular expressions.

FIELD_NORM_NUMERIC

The FIELD_NORM_NUMERIC method converts numbers written in words into numbers expressed in digits, possibly applying a multiplication factor to the numbers obtained.

Consider for example this template:

TEMPLATE(DISTANCE)
{
    @KILOMETERS,
    @METERS
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(DISTANCE)
    {
        LEMMA("distance")
        <1:4>
        KEYWORD("a")
        <1:4>
        KEYWORD("b")
        <1:3>
        LEMMA("be")
        <1:3>
        @METERS[KEYWORD("five thousand")]
    }
}

is applied to this input text

The distance from  A to B is five thousand m.

with this code (see also the FIELD_CLONE method of mergepost):

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, "DISTANCE", "METERS", "KILOMETERS");
    return result
}

you get:

If you change the code like this:

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, "DISTANCE", "METERS", "KILOMETERS");
    normalizepost.FIELD_NORM_NUMERIC(result, "EN", "DISTANCE", "KILOMETERS", "*m");
    return result
}

and apply the rule above to the same input text, you get:

The number in words has been converted to digits, then the multiplying factor *m, where the "m" stands for "milli", corresponding to x 10^-3, is applied.

If the field value is already a number expressed with digits, no conversion takes place, but the number is recognized as such and the possible multiplication factor is applied.
So, in the case of the example above, if the initial value of field KILOMETERS had been 5000, it would have become 5 all the same.

The syntax is:

moduleVariable.FIELD_NORM_NUMERIC(result, lang, templateName, fieldName, adapt)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • result is the object containing the analysis results.
  • lang refers to the language in which numbers in words are written. Possible values are:
    • EN for English
    • IT for Italian
    • ES for Spanish
    • DE for German
    • NL for Dutch
  • templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
  • fieldName is the field name.
  • adapt is a multiplication factor. It can be an empty string, in which case the numeric value is not altered, or it can be one the following:

    Value Multiplication factor
    *p (pico) x 10-12
    *n (nano) x 10-9
    *u (micro) x 10-6
    *m (milli) x 10-3
    *K (kilo) x 103
    *M (mega) x 106
    *G (giga) x 109
    *T (tera) x 1012

RENEXTRA

The RENEXTRA method renames records' templates and fields.

Consider for example this template:

TEMPLATE(PERSONAL_DATA)
{
    @NAME,
    @ADDRESS,
    @JOB,
    @AGE
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @NAME[TYPE(NPH)]
        <>
        @AGE[PATTERN("[1-9][0-9]")]
        <>
        @JOB[LEMMA("technical writer")]
    }
}

is applied to this input text:

Christine is 42 years old and works as a technical writer.

you get this output:

With this code:

function onFinalize(result) {
    normalizepost.RENEXTRA(result, "PERSONAL_DATA", "PERSONAL_INFORMATION", null);
    return result;
}

if the rule above is applied to the same input text, you get:

The template name for PERSONAL_DATA records has changed to PERSONAL_INFORMATION

If the code was:

function onFinalize(result) {
    normalizepost.RENEXTRA(result, "PERSONAL_DATA", null, [{name: "NAME", new: "PROPER_NAME"}, {name: "AGE", new: "YEARS_OF_AGE"}]);
    return result;
}

the output would be:

In this case, fields have been renamed and the old names have become the values of the fields themselves.

Note

Odd as may seem, this behavior is by design. In fact, this method replicates the behavior of a post-processor that was available in the legacy technology of which Studio represents the evolution and is meant to be used for backward compatibility when importing old projects. For an alternative way to rename records' templates and fields, consider the JsonPlug module.

RENEXTRA is typically used in combination with REPLACEFIELD (see below).

The syntax is:

moduleVariable.RENEXTRA(result, templateName, newTemplateName, renameRules)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • result is the object containing the analysis results.
  • templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
  • newTemplateName is the new template name that will replace the old one. If null, the template name is not changed, so use null if you only want to rename fields.
  • renameRules is an array containing objects each representing a field rename rule and having the following properties:

    • name: old field name.
    • new: new field name.

    If null, fields are not renamed, so use null if you only want to change the template name.

REPLACEFIELD

The REPLACEFIELD method changes all the matches of a regular expression inside the values of all the fields that have been renamed using the RENEXTRA method.

Consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Job,
    @Product,
    @Company,
    @Role
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <1:5>
        @Job[LEMMA("software engineer")]
        <1:4>
        @Company[TYPE(COM)]
    }
}

is applied to this input text:

George Dickinson is a software engineer for Acme Ltd.

you will normally get:

With this code:

function onFinalize(result) {
   var renameRules = [{name: "Name", new: "PseudoName"}]
   normalizepost.RENEXTRA(result, "PERSONAL_DATA", null, renameRules);  
   normalizepost.REPLACEFIELD(result, "Name", "John Doe"); 
   return result;
}

it becomes:

The syntax is:

moduleVariable.REPLACEFIELD(result, regularExpression, replacementString)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • result is the object containing the analysis results.
  • regularExpression is a string containing a regular expression used to find the parts of fields' value to change.
  • replacementString is the string that replaces all the matches of regularexpression in fields' values.

SPLIT_NORM_LIST_REPLACE

Use the SPLIT_NORM_LIST_REPLACE to set an array of replacement objects starting from a list file. This array can be used as value of the normalizationRules parameter of the FIELD_NORM_STRING method when the value of its normalizationType parameter is REPLACE.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Address,
    @Phone,
    @Job
}

If this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <>
        @Job[LEMMA("software engineer")]
    }
}

is applied to this text:

John is a software engineer.

you will get:

With this code:

function onFinalize(result) {
    var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE("replacements.cl")
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE", replacementList);
    return result;
}

and if the replacements.cl list file has these contents:

programmer=software engineer|developer|dev

you will get:

The syntax of SPLIT_NORM_LIST_REPLACE is:

moduleVariable.SPLIT_NORM_LIST_REPLACE(listFileName)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • listFileName is the list file name. List files must be placed in a folder called normalizepost, which in turn must be located either in the rules or in the modules folder of the project.

Note

In case you have two sub-folders in both locations and the same list file in both sub-folders, priority is given to that under the rules folder.

Each line of the list file must have this syntax:

newValue=valueToReplace_1[|valueToReplace_2...|valueToReplace_n]

where:

  • newValue is the new value.
  • valueToReplace_n is a value to replace.

If newValue is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

SPLIT_NORM_LIST_REPLACE_REGEX

Like SPLIT_NORM_LIST_REPLACE with the difference that list files are compiled with JavaScript regular expressions. The array can be used as value of the normalizationRules parameter of the FIELD_NORM_STRING method when the value of its normalizationType parameter is REPLACE_REGEX.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Address,
    @Phone,
    @Job
}

If this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <>
        @Job[LEMMA("software developer")]
    }
}

is applied to this text:

John is a software developer.

you will get:

With this code:

function onFinalize(result) {
    var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE_REGEX("replacements.cl")
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE_REGEX", replacementList);
    return result;
}

and if the replacements.cl list file has these contents:

$2 programmer=\b((software) developer)\b/i

you will get:

The syntax of SPLIT_NORM_LIST_REPLACE_REGEX is:

moduleVariable.SPLIT_NORM_LIST_REPLACE_REGEX(listFileName)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • listFileName is the list file name. List files must be placed in a folder called normalizepost, which in turn must be located either in the rules or in the modules folder of the project.

Note

In case you have two sub-folders in both locations and the same list file in both sub-folders, priority is given to that under the rules folder.

Each line of the list file must have this syntax:

replacementString=regularExpression

where:

  • replacementString is the replacement string which can possibly contain reference to capturing groups like $1, $2, etc.
  • regularExpression is the JavaScript regular expression.

If replacementString is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

load

The load method prepares one or more of the operations that can be attained with the methods above, but using as its source a configuration file generated when importing a project created with a legacy edition of Studio. Prepared operations are then applied using the apply method.

Warning

The use of the load method is not required in cases other than that indicated above and the import procedure already generates the appropriate statements inside the main.jr file, so there are basically no cases in which you have to write code that uses this method.

For example, when importing an old project, Studio may generate this code:

function initialize(cmdline) {
    if (!normalizepost.load('Config.xml')) {
        CONSOLE.error(normalizepost.getLastError());
        return false;
    }
    return true;
}

function onFinalize(result) {
    result = normalizepost.apply(result);
    return result;
}

Its syntax is:

moduleVariable.load(configPath)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • configPath is the config.xml file path generated by the import procedure.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

apply

The apply method performs all the operations prepared with the invocation of the load method.

For example:

function onFinalize(result) {
    result = normalizepost.apply(result);
    return result;
}

The syntax is:

moduleVariable.apply(result)

where:

  • moduleVariable is the variable corresponding to the module and set with require().
  • result is the object containing the analysis results.

getLastError

The getLastError method retrieves the message corresponding to the last error that occurred when theload method fails. Use it to display the error message.

For example:

function initialize(cmdline) {
    if (!normalizepost.load('Config.xml')) {
        CONSOLE.error(normalizepost.getLastError());
        return false;
    }
}

The syntax is:

moduleVariable.getLastError()

where moduleVariable is the variable corresponding to the module and set with require().

close

The close method is used to free up the resources allocated by the normalizepost module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown function.

For example:

function shutdown() {
    normalizepost.close();
}

The syntax is:

moduleVariable.close()

where moduleVariable is the variable corresponding to the module and set with require().