normalizepost

Overview

normalizepost is a scripting module providing transformation and normalization features for extraction records.

The available methods for this module are:

FIELD_NORM_STRING
FIELD_NORM_NUMERIC
RENEXTRA
REPLACEFIELD
SPLIT_NORM_LIST_REPLACE
SPLIT_NORM_LIST_REPLACE_REGEX
load
apply
getLastError
close

When in Studio you install the normalizepost module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:

var normalizepost = require("modules/normalizepost");

The statement above sets a variable with an instance of the module so that you can use it inside event handling functions.

All methods except for load, getLastError and close must be used in the onFinalize function, because they act on the analysis results available when this function is run.

The load method must be used in the initialize function, because it is the right place for the initialization of objects needed in other event handling functions.

The getLastError method must be used in the initialize function, because it retrieves the message corresponding to the last error that occurred when the load method fails.

The close method must be used in the shutdown function, because it is used to free up the resources allocated by the normalizepost module object.

FIELD_NORM_STRING

Purpose and syntax

FIELD_NORM_STRING normalizes fields values in various ways.

The main syntax is:

moduleVariable.FIELD_NORM_STRING(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
- fieldName is the field name.
- normalizationType is the normalization type. It's a string that can be:
  - CASE
  - BOOL
  - REPLACE
  - REPLACE_REGEX
  - REPLACE_REGEX_DEBUG_MODE
  Normalization types are described below.
- normalizationRules is the specification of the normalization and its possible values vary based on the value of normalizationType (see below).
- fallbackValue (optional, valid only for REPLACE, REPLACE_REGEX and REPLACE_REGEX_DEBUG_MODE)
  - if not declared, non-normalized values will be left in the output as they are.
  - if declared, the value it will act as a replacement value in case no normalization could be applied.
  - if an empty string is specified, the non-normalized value will be deleted from the output.
- debugFlag (optional, valid only for REPLACE_REGEX) if set to true, it will print debug information in the console during regex replacements (see REPLACE_REGEX_DEBUG_MODE for more information).

The method also supports a parametric syntax, that is:

moduleVariable.FIELD_NORM_STRING(result, templateName, fieldName, normalizationType, normalizationRules[, fallbackValue, debugFlag])

Normalization types and formats

CASE

CASE normalization applies a specific letter case to field values.
Parameter normalizationRules must be a string with one of these values:

UCASE: all words are upper-cased
LCASE: all words are lower-cased
TCASE: simplified title case, all words are capitalized

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "NAME",
        normalizationType: "CASE",
        normalizationRules: "TCASE"
    });
    return result;
}

capitalizes all the words in the value of each occurrence of field NAME in all the PRESONAL_DATA template records.

BOOL

BOOL normalization replaces values like yes and no with alternative text.
If the alternative text is empty, the field is deleted. If, as consequence of the replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
Parameter normalizationRules must be a string with this syntax:

alternativeForYes\alternativeForNo

where alternativeForYes is the replacement for "yes" values and alternativeForNo is the replacement for "no" values.

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "*",
        fieldName: "Answer",
        normalizationType: "BOOL",
        normalizationRules: "TRUE\\FALSE"
    });
    return result;
}

changes all "yes" values to TRUE and all "no" values to FALSE for all the instances of field Answer occurring in any record of any template.

The method recognizes "yes" and "no" values written in English, Spanish, French, German and Italian, no matter which the language of the analysis project is.

REPLACE

REPLACE normalization replaces a given field value with another.

If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

Parameter normalizationRules can be:

An object with this structure:
```
{
    valueToReplace: newValue
}
```
where valueToReplace is the value to replace. Its value must be in lower case and the match between it and the fields' values is case insensitive and newValue is the replacement value.

Or:

The name of a variable set with SPLIT_NORM_LIST_REPLACE.

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "Job",
        normalizationType: "REPLACE",
        normalizationRules: {"software engineer": "programmer"}
    });
    return result;
}

replaces with programmer all the occurrences of software engineer, regardless of the letter case, for the Job field in all the PERSONAL_DATA template records.

By default, if a value cannot be normalized, it remains unchanged. However, the user can specify a default value to use in the event that normalization fails. This default value must be specified inside the arguments object within a key named fallbackValue. For instance:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "Job",
        normalizationType: "REPLACE",
        normalizationRules: {"software engineer": "programmer"},
        fallbackValue: "Other job"
    });
    return result;
}

In the example shown, any value extracted that is not software engineer will be mapped to Other job.

Note

It is also possible to declare an empty string as the fall-back value to remove fields that could not be normalized.

REPLACE_REGEX

REPLACE_REGEX normalization replaces any occurrence of a given JavaScript regular expression with an alternative text which can possibly contain reference to capturing groups like $1, $2, etc.

If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

Parameter normalizationRules can be:

An array of objects object with this structure:
```
{
    regexp: regularExpression,
    value: replacementText
}
```
where regularExpression is the JavaScript regular expression used to find the text to replace and replacementText is the replacement text.

Or:

The name of a variable set with SPLIT_NORM_LIST_REPLACE_REGEX

For example, this code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "CRIMINAL",
        fieldName: "FaceFeature",
        normalizationType: "REPLACE_REGEX",
        normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]
    });
    return result;
}

replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records.

By default, if a value cannot be matched by any regex, it remains unchanged. However, the user can specify a default value to use if no regex matches. This default value must be specified after the normalizationRules object. For instance:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "CRIMINAL",
        fieldName: "FaceFeature",
        normalizationType: "REPLACE_REGEX",
        normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}],
        fallbackValue: "Unknown size"
    });
    return result;
}

In the example, any value not altered by a regex will be replaced with Unknown size.

Note

It is also possible to declare an empty string as the fall-back value to remove fields that were not altered by any regex.

REPLACE_REGEX_DEBUG_MODE

REPLACE_REGEX_DEBUG_MODE acts like REPLACE_REGEX but it also activates a debug mode to check if your regular expressions are working.

This code:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "CRIMINAL",
        fieldName: "FaceFeature",
        normalizationType: "REPLACE_REGEX_DEBUG_MODE",
        normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]
    });
    return result;
}

replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records. You can see notifications about your regular expressions from the Console tool window:

input string "big nose" modified as "nose: large" by the regex at index 0.

Note

When utilizing the SPLIT_NORM_LIST_REPLACE_REGEX method to load an external list, the debug message will provide more context by displaying the name of the list and the line number at which the replacement was triggered, rather than just the plain index.

You can also activate this mode using REPLACE_REGEX but adding a debugFlag property, like this:

function onFinalize(result) {
    normalizepost.FIELD_NORM_STRING(result, {
        templateName: "CRIMINAL",
        fieldName: "FaceFeature",
        normalizationType: "REPLACE_REGEX",
        normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}],
        fallbackValue: "No normalization applied",
        debugFlag: true
    });
    return result;
}

This property can be:

true (boolean) or debug mode: specify all strings transformed by the regular expressions.
verbose debug mode: specify all strings transformed and untransformed by the regular expressions.

FIELD_NORM_NUMERIC

The FIELD_NORM_NUMERIC method converts numbers written in words into numbers expressed in digits, possibly applying a multiplication factor to the numbers obtained.

Consider for example this template:

TEMPLATE(DISTANCE)
{
    @KILOMETERS,
    @METERS
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(DISTANCE)
    {
        LEMMA("distance")
        <1:4>
        KEYWORD("a")
        <1:4>
        KEYWORD("b")
        <1:3>
        LEMMA("be")
        <1:3>
        @METERS[KEYWORD("five thousand")]
    }
}

is applied to this input text

The distance from  A to B is five thousand m.

with this code (see also the FIELD_CLONE method of mergepost):

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, {
        templateName: "DISTANCE",
        fieldName: "METERS",
        clonedFieldName: "KILOMETERS"
    })
    return result
}

you get this record:

Template: DISTANCE

Field	Value
@METERS	five thousand
@KILOMETERS	five thousand

If you change the code like this:

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, {
        templateName: "DISTANCE",
        fieldName: "METERS",
        clonedFieldName: "KILOMETERS"
    })

    normalizepost.FIELD_NORM_NUMERIC(result, {
        lang: "EN",
        templateName: "DISTANCE",
        fieldName: "KILOMETERS",
        adapt: "*m"
    })
    return result
}

and apply the rule above to the same input text, you get:

Template: DISTANCE

Field	Value
@METERS	five thousand
@KILOMETERS	5

The number in words has been converted to digits, then the multiplying factor *m, where the "m" stands for "milli", corresponding to x 10^-3, is applied.

If the field value is already a number expressed with digits, no conversion takes place, but the number is recognized as such and the possible multiplication factor is applied.
So, in the case of the example above, if the initial value of field KILOMETERS had been 5000, it would have become 5 all the same.

The syntax is:

moduleVariable.FIELD_NORM_NUMERIC(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- lang refers to the language in which numbers in words are written. Possible values are:
  - EN for English
  - IT for Italian
  - ES for Spanish
  - DE for German
  - NL for Dutch
- templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
- fieldName is the field name.
- adapt is a multiplication factor. It can be an empty string, in which case the numeric value is not altered, or it can be one the following:
Value Multiplication factor

*p (pico) x 10^-12

*n (nano) x 10^-9

*u (micro) x 10^-6

*m (milli) x 10^-3

*K (kilo) x 10³

*M (mega) x 10⁶

*G (giga) x 10⁹

*T (tera) x 10¹²

The method also supports a parametric syntax, that is:

moduleVariable.FIELD_NORM_NUMERIC(result, lang, templateName, fieldName, adapt)

RENEXTRA

The RENEXTRA method renames records' templates and fields.

Consider for example this template:

TEMPLATE(PERSONAL_DATA)
{
    @NAME,
    @ADDRESS,
    @JOB,
    @AGE
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @NAME[TYPE(NPH)]
        <>
        @AGE[PATTERN("[1-9][0-9]")]
        <>
        @JOB[LEMMA("technical writer")]
    }
}

is applied to this input text:

Christine is 42 years old and works as a technical writer.

you get this record:

Template: PERSONAL_DATA

Field	Value
@NAME	Christine
@JOB	technical writer
@AGE	42

With this code:

function onFinalize(result) {
    normalizepost.RENEXTRA(result, {
        templateName: "PERSONAL_DATA",
        newTemplateName: "PERSONAL_INFORMATION",
        renameRules: null
    })
    return result;
}

if the rule above is applied to the same input text, you get this record:

Template: PERSONAL_INFORMATION

Field	Value
@NAME	Christine
@JOB	technical writer
@AGE	42

The template name for PERSONAL_DATA record has changed to PERSONAL_INFORMATION.

If the code was:

function onFinalize(result) {
    normalizepost.RENEXTRA(result, {
        templateName: "PERSONAL_DATA",
        newTemplateName: null,
        renameRules: [{name: "NAME", new: "PROPER_NAME"}, {name: "AGE", new: "YEARS_OF_AGE"}]
    })
    return result;
}

the output would be:

Template: PERSONAL_DATA

Field	Value
@YEARS_OF_AGE	AGE
@PROPER_NAME	NAME
@JOB	technical writer

In this case, fields have been renamed and the old names have become the values of the fields themselves.

Note

Odd as may seem, this behavior is by design. In fact, this method replicates the behavior of a post-processor that was available in the legacy technology of which Studio represents the evolution and is meant to be used for backward compatibility when importing old projects. For an alternative way to rename records' templates and fields, consider the JsonPlug module.

RENEXTRA is typically used in combination with REPLACEFIELD (see below).

The syntax is:

moduleVariable.RENEXTRA(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to act upon. It can be an asterisk (*), which means "any template".
- newTemplateName is the new template name that will replace the old one. If null, the template name is not changed, so use null if you only want to rename fields.
- renameRules is an array containing objects each representing a field rename rule and having the following properties:
  - name: old field name.
  - new: new field name.
  If null, fields are not renamed, so use null if you only want to change the template name.

The method also supports a purely parametric syntax, that is:

moduleVariable.RENEXTRA(result, templateName, newTemplateName, renameRules)

REPLACEFIELD

The REPLACEFIELD method changes all the matches of a regular expression inside the values of all the fields that have been renamed using the RENEXTRA method.

Consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Job,
    @Product,
    @Company,
    @Role
}

If the following rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <1:5>
        @Job[LEMMA("software engineer")]
        <1:4>
        @Company[TYPE(COM)]
    }
}

is applied to this input text:

George Dickinson is a software engineer for Acme Ltd.

you will normally get this record:

Template: PERSONAL_DATA

Field	Value
@Name	George Dickinson
@Job	software engineer
@Company	Acme Ltd.

With this code:

function onFinalize(result) {
   var renameRules = [{name: "Name", new: "PseudoName"}]
    normalizepost.RENEXTRA(result, {
        templateName: "PERSONAL_DATA",
        newTemplateName: null,
        renameRules: renameRules
    })
   normalizepost.REPLACEFIELD(result, {
        regularExpression: "Name",
        replaceValue: "John Doe"
   })
   return result;
}

the record becomes:

Template: PERSONAL_DATA

Field	Value
@PseudoName	John Doe
@Job	software engineer
@Company	Acme Ltd.

The syntax is:

moduleVariable.REPLACEFIELD(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- regularExpression is a string containing a regular expression used to find the parts of fields' value to change.
- replaceValue is the string that replaces all the matches of regularexpression in fields' values.

The method also supports a purely parametric syntax, that is:

moduleVariable.REPLACEFIELD(result, regularExpression, replaceValue)

SPLIT_NORM_LIST_REPLACE

Use the SPLIT_NORM_LIST_REPLACE to set an array of replacement objects starting from a list file. This array can be used as value of the normalizationRules parameter of the FIELD_NORM_STRING method when the value of its normalizationType parameter is REPLACE.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Address,
    @Phone,
    @Job
}

If this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <>
        @Job[LEMMA("software engineer")]
    }
}

is applied to this text:

John is a software engineer.

you will get:

Template: PERSONAL_DATA

Field	Value
@Name	John
@Job	software engineer

With this code:

function onFinalize(result) {
    var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE("replacements.cl")
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE", replacementList);
    return result;
}

and if the replacements.cl list file has these contents:

programmer=software engineer|developer|dev

you will get:

Template: PERSONAL_DATA

Field	Value
@Name	John
@Job	programmer

The syntax of SPLIT_NORM_LIST_REPLACE is:

moduleVariable.SPLIT_NORM_LIST_REPLACE(listFileName)

where:

moduleVariable is the variable corresponding to the module and set with require().
listFileName is the list file name. List files should be placed in a folder called normalizepost, which in turn must be located either in the rules or in the modules folder of the project.

Note

If you have two subfolders containing the same list file in both locations, the normalizepost folder under the rules folder takes precedence.
If the list is not found within the normalizepost folder, the method will attempt to load the entire path as if it starts from the rules folder.

Each line of the list file must have this syntax:

newValue=valueToReplace_1[|valueToReplace_2...|valueToReplace_n]

where:

newValue is the new value.
valueToReplace_n is a value to replace.

If newValue is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

SPLIT_NORM_LIST_REPLACE_REGEX

Like SPLIT_NORM_LIST_REPLACE with the difference that list files are compiled with JavaScript regular expressions. The array can be used as value of the normalizationRules parameter of the FIELD_NORM_STRING method when the value of its normalizationType parameter is REPLACE_REGEX.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Address,
    @Phone,
    @Job
}

If this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]
        <>
        @Job[LEMMA("software developer")]
    }
}

is applied to this text:

John is a software developer.

you will get this record:

Template: PERSONAL_DATA

Field	Value
@Name	John
@Job	software developer

With this code:

function onFinalize(result) {
    var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE_REGEX("replacements.cl")
    normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE_REGEX", replacementList);
    return result;
}

and if the replacements.cl list file has these contents:

$2 programmer=\b((software) developer)\b/i

you will get this record:

Template: PERSONAL_DATA

Field	Value
@Name	John
@Job	software programmer

The syntax of SPLIT_NORM_LIST_REPLACE_REGEX is:

moduleVariable.SPLIT_NORM_LIST_REPLACE_REGEX(listFileName)

where:

moduleVariable is the variable corresponding to the module and set with require().
listFileName is the list file name. List files should be placed in a folder called normalizepost, which in turn must be located either in the rules or in the modules folder of the project.

Note

If you have two subfolders containing the same list file in both locations, the normalizepost folder under the rules folder takes precedence.
If the list is not found within the normalizepost folder, the method will attempt to load the entire path as if it starts from the rules folder.

Each line of the list file must have this syntax:

replacementString=regularExpression

where:

replacementString is the replacement string which can possibly contain reference to capturing groups like $1, $2, etc.
regularExpression is the JavaScript regular expression.

If replacementString is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.

load

The load method prepares one or more of the operations that can be attained with the methods above, but using as its source a configuration file generated when importing a project created with a legacy edition of Studio. Prepared operations are then applied using the apply method.

Warning

The use of the load method is not required in cases other than that indicated above and the import procedure already generates the appropriate statements inside the main.jr file, so there are basically no cases in which you have to write code that uses this method.

For example, when importing an old project, Studio may generate this code:

function initialize(cmdline) {
    if (!normalizepost.load('Config.xml')) {
        CONSOLE.error(normalizepost.getLastError());
        return false;
    }
    return true;
}

function onFinalize(result) {
    result = normalizepost.apply(result);
    return result;
}

Its syntax is:

moduleVariable.load(configPath)

where:

moduleVariable is the variable corresponding to the module and set with require().
configPath is the config.xml file path generated by the import procedure.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

apply

The apply method performs all the operations prepared with the invocation of the load method.

For example:

function onFinalize(result) {
    result = normalizepost.apply(result);
    return result;
}

The syntax is:

moduleVariable.apply(result)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.

getLastError

The getLastError method retrieves the message corresponding to the last error that occurred when theload method fails. Use it to display the error message.

For example:

function initialize(cmdline) {
    if (!normalizepost.load('Config.xml')) {
        CONSOLE.error(normalizepost.getLastError());
        return false;
    }
}

The syntax is:

moduleVariable.getLastError()

where moduleVariable is the variable corresponding to the module and set with require().

close

The close method is used to free up the resources allocated by the normalizepost module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown function.

For example:

function shutdown() {
    normalizepost.close();
}

The syntax is:

moduleVariable.close()

where moduleVariable is the variable corresponding to the module and set with require().

Value	Multiplication factor
*p (pico)	x 10^-12
*n (nano)	x 10^-9
*u (micro)	x 10^-6
*m (milli)	x 10^-3
*K (kilo)	x 10³
*M (mega)	x 10⁶
*G (giga)	x 10⁹
*T (tera)	x 10¹²