normalizepost
Overview
normalizepost is a scripting module providing transformation and normalization features for extraction records.
The available methods for this module are:
FIELD_NORM_STRING
FIELD_NORM_NUMERIC
RENEXTRA
REPLACEFIELD
SPLIT_NORM_LIST_REPLACE
SPLIT_NORM_LIST_REPLACE_REGEX
load
apply
getLastError
close
When in Studio you install the normalizepost module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:
var normalizepost = require("modules/normalizepost");
The statement above sets a variable with an instance of the module so that you can use it inside event handling functions.
All methods except for load
, getLastError
and close
must be used in the onFinalize
function, because they act on the analysis results available when this function is run.
The load
method must be used in the initialize
function, because it is the right place for the initialization of objects needed in other event handling functions.
The getLastError
method must be used in the initialize
function, because it retrieves the message corresponding to the last error that occurred when the load
method fails.
The close
method must be used in the shutdown
function, because it is used to free up the resources allocated by the normalizepost module object.
FIELD_NORM_STRING
Purpose and syntax
FIELD_NORM_STRING
normalizes fields values in various ways.
The main syntax is:
moduleVariable.FIELD_NORM_STRING(result, arguments)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.result
is the object containing the analysis results.-
arguments
is an object containing the parameters to be used. Such parameters are:templateName
is the template name of the records to act upon. It can be an asterisk (*
), which means "any template".fieldName
is the field name.-
normalizationType
is the normalization type. It's a string that can be:- CASE
- BOOL
- REPLACE
- REPLACE_REGEX
- REPLACE_REGEX_DEBUG_MODE
Normalization types are described below.
-
normalizationRules
is the specification of the normalization and its possible values vary based on the value ofnormalizationType
(see below). fallbackValue
(optional, valid only for REPLACE, REPLACE_REGEX and REPLACE_REGEX_DEBUG_MODE)- if not declared, non-normalized values will be left in the output as they are.
- if declared, the value it will act as a replacement value in case no normalization could be applied.
- if an empty string is specified, the non-normalized value will be deleted from the output.
debugFlag
(optional, valid only for REPLACE_REGEX) if set to true, it will print debug information in the console during regex replacements (see REPLACE_REGEX_DEBUG_MODE for more information).
The method also supports a parametric syntax, that is:
moduleVariable.FIELD_NORM_STRING(result, templateName, fieldName, normalizationType, normalizationRules[, fallbackValue, debugFlag])
Normalization types and formats
CASE
CASE normalization applies a specific letter case to field values.
Parameter normalizationRules
must be a string with one of these values:
- UCASE: all words are upper-cased
- LCASE: all words are lower-cased
- TCASE: simplified title case, all words are capitalized
For example, this code:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "PERSONAL_DATA",
fieldName: "NAME",
normalizationType: "CASE",
normalizationRules: "TCASE"
});
return result;
}
capitalizes all the words in the value of each occurrence of field NAME in all the PRESONAL_DATA template records.
BOOL
BOOL normalization replaces values like yes and no with alternative text.
If the alternative text is empty, the field is deleted. If, as consequence of the replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
Parameter normalizationRules
must be a string with this syntax:
alternativeForYes\alternativeForNo
where alternativeForYes
is the replacement for "yes" values and alternativeForNo
is the replacement for "no" values.
For example, this code:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "*",
fieldName: "Answer",
normalizationType: "BOOL",
normalizationRules: "TRUE\\FALSE"
});
return result;
}
changes all "yes" values to TRUE and all "no" values to FALSE for all the instances of field Answer occurring in any record of any template.
The method recognizes "yes" and "no" values written in English, Spanish, French, German and Italian, no matter which the language of the analysis project is.
REPLACE
REPLACE normalization replaces a given field value with another.
If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
Parameter normalizationRules
can be:
-
An object with this structure:
{ valueToReplace: newValue }
where
valueToReplace
is the value to replace. Its value must be in lower case and the match between it and the fields' values is case insensitive andnewValue
is the replacement value.
Or:
- The name of a variable set with
SPLIT_NORM_LIST_REPLACE
.
For example, this code:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "PERSONAL_DATA",
fieldName: "Job",
normalizationType: "REPLACE",
normalizationRules: {"software engineer": "programmer"}
});
return result;
}
replaces with programmer all the occurrences of software engineer, regardless of the letter case, for the Job field in all the PERSONAL_DATA template records.
By default, if a value cannot be normalized, it remains unchanged. However, the user can specify a default value to use in the event that normalization fails. This default value must be specified inside the arguments
object within a key named fallbackValue
. For instance:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "PERSONAL_DATA",
fieldName: "Job",
normalizationType: "REPLACE",
normalizationRules: {"software engineer": "programmer"},
fallbackValue: "Other job"
});
return result;
}
In the example shown, any value extracted that is not software engineer will be mapped to Other job.
Note
It is also possible to declare an empty string as the fall-back value to remove fields that could not be normalized.
REPLACE_REGEX
REPLACE_REGEX normalization replaces any occurrence of a given JavaScript regular expression with an alternative text which can possibly contain reference to capturing groups like $1, $2, etc.
If the replacement text is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
Parameter normalizationRules
can be:
-
An array of objects object with this structure:
{ regexp: regularExpression, value: replacementText }
where
regularExpression
is the JavaScript regular expression used to find the text to replace andreplacementText
is the replacement text.
Or:
- The name of a variable set with
SPLIT_NORM_LIST_REPLACE_REGEX
For example, this code:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "CRIMINAL",
fieldName: "FaceFeature",
normalizationType: "REPLACE_REGEX",
normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]
});
return result;
}
replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records.
By default, if a value cannot be matched by any regex, it remains unchanged. However, the user can specify a default value to use if no regex matches. This default value must be specified after the normalizationRules
object. For instance:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "CRIMINAL",
fieldName: "FaceFeature",
normalizationType: "REPLACE_REGEX",
normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}],
fallbackValue: "Unknown size"
});
return result;
}
In the example, any value not altered by a regex will be replaced with Unknown size.
Note
It is also possible to declare an empty string as the fall-back value to remove fields that were not altered by any regex.
REPLACE_REGEX_DEBUG_MODE
REPLACE_REGEX_DEBUG_MODE acts like REPLACE_REGEX but it also activates a debug mode to check if your regular expressions are working.
This code:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "CRIMINAL",
fieldName: "FaceFeature",
normalizationType: "REPLACE_REGEX_DEBUG_MODE",
normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}]
});
return result;
}
replaces expressions like big nose or small ears with nose: large and ears: small in the FaceFeature field values of CRIMINAL template records. You can see notifications about your regular expressions from the Console tool window:
input string "big nose" modified as "nose: large" by the regex at index 0.
Note
When utilizing the SPLIT_NORM_LIST_REPLACE_REGEX method to load an external list, the debug message will provide more context by displaying the name of the list and the line number at which the replacement was triggered, rather than just the plain index.
You can also activate this mode using REPLACE_REGEX but adding a debugFlag
property, like this:
function onFinalize(result) {
normalizepost.FIELD_NORM_STRING(result, {
templateName: "CRIMINAL",
fieldName: "FaceFeature",
normalizationType: "REPLACE_REGEX",
normalizationRules: [{"regexp": /(?:great|big|large) (.*)/, "value": "$1: large"}, {"regexp": /(?:small|little|tiny|) (.*)/, "value": "$1: small"}],
fallbackValue: "No normalization applied",
debugFlag: true
});
return result;
}
This property can be:
true
(boolean) ordebug mode
: specify all strings transformed by the regular expressions.verbose debug mode
: specify all strings transformed and untransformed by the regular expressions.
FIELD_NORM_NUMERIC
The FIELD_NORM_NUMERIC
method converts numbers written in words into numbers expressed in digits, possibly applying a multiplication factor to the numbers obtained.
Consider for example this template:
TEMPLATE(DISTANCE)
{
@KILOMETERS,
@METERS
}
If the following rule:
SCOPE SENTENCE
{
IDENTIFY(DISTANCE)
{
LEMMA("distance")
<1:4>
KEYWORD("a")
<1:4>
KEYWORD("b")
<1:3>
LEMMA("be")
<1:3>
@METERS[KEYWORD("five thousand")]
}
}
is applied to this input text
The distance from A to B is five thousand m.
with this code (see also the FIELD_CLONE
method of mergepost):
function onFinalize(result) {
mergepost.FIELD_CLONE(result, {
templateName: "DISTANCE",
fieldName: "METERS",
clonedFieldName: "KILOMETERS"
})
return result
}
you get this record:
Template: DISTANCE
Field | Value |
---|---|
@METERS | five thousand |
@KILOMETERS | five thousand |
If you change the code like this:
function onFinalize(result) {
mergepost.FIELD_CLONE(result, {
templateName: "DISTANCE",
fieldName: "METERS",
clonedFieldName: "KILOMETERS"
})
normalizepost.FIELD_NORM_NUMERIC(result, {
lang: "EN",
templateName: "DISTANCE",
fieldName: "KILOMETERS",
adapt: "*m"
})
return result
}
and apply the rule above to the same input text, you get:
Template: DISTANCE
Field | Value |
---|---|
@METERS | five thousand |
@KILOMETERS | 5 |
The number in words has been converted to digits, then the multiplying factor *m, where the "m" stands for "milli", corresponding to x 10-3, is applied.
If the field value is already a number expressed with digits, no conversion takes place, but the number is recognized as such and the possible multiplication factor is applied.
So, in the case of the example above, if the initial value of field KILOMETERS had been 5000, it would have become 5 all the same.
The syntax is:
moduleVariable.FIELD_NORM_NUMERIC(result, arguments)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.result
is the object containing the analysis results.-
arguments
is an object containing the parameters to be used. Such parameters are:lang
refers to the language in which numbers in words are written. Possible values are:- EN for English
- IT for Italian
- ES for Spanish
- DE for German
- NL for Dutch
templateName
is the template name of the records to act upon. It can be an asterisk (*
), which means "any template".fieldName
is the field name.adapt
is a multiplication factor. It can be an empty string, in which case the numeric value is not altered, or it can be one the following:
Value Multiplication factor *p (pico) x 10-12 *n (nano) x 10-9 *u (micro) x 10-6 *m (milli) x 10-3 *K (kilo) x 103 *M (mega) x 106 *G (giga) x 109 *T (tera) x 1012
The method also supports a parametric syntax, that is:
moduleVariable.FIELD_NORM_NUMERIC(result, lang, templateName, fieldName, adapt)
RENEXTRA
The RENEXTRA
method renames records' templates and fields.
Consider for example this template:
TEMPLATE(PERSONAL_DATA)
{
@NAME,
@ADDRESS,
@JOB,
@AGE
}
If the following rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@NAME[TYPE(NPH)]
<>
@AGE[PATTERN("[1-9][0-9]")]
<>
@JOB[LEMMA("technical writer")]
}
}
is applied to this input text:
Christine is 42 years old and works as a technical writer.
you get this record:
Template: PERSONAL_DATA
Field | Value |
---|---|
@NAME | Christine |
@JOB | technical writer |
@AGE | 42 |
With this code:
function onFinalize(result) {
normalizepost.RENEXTRA(result, {
templateName: "PERSONAL_DATA",
newTemplateName: "PERSONAL_INFORMATION",
renameRules: null
})
return result;
}
if the rule above is applied to the same input text, you get this record:
Template: PERSONAL_INFORMATION
Field | Value |
---|---|
@NAME | Christine |
@JOB | technical writer |
@AGE | 42 |
The template name for PERSONAL_DATA record has changed to PERSONAL_INFORMATION.
If the code was:
function onFinalize(result) {
normalizepost.RENEXTRA(result, {
templateName: "PERSONAL_DATA",
newTemplateName: null,
renameRules: [{name: "NAME", new: "PROPER_NAME"}, {name: "AGE", new: "YEARS_OF_AGE"}]
})
return result;
}
the output would be:
Template: PERSONAL_DATA
Field | Value |
---|---|
@YEARS_OF_AGE | AGE |
@PROPER_NAME | NAME |
@JOB | technical writer |
In this case, fields have been renamed and the old names have become the values of the fields themselves.
Note
Odd as may seem, this behavior is by design. In fact, this method replicates the behavior of a post-processor that was available in the legacy technology of which Studio represents the evolution and is meant to be used for backward compatibility when importing old projects. For an alternative way to rename records' templates and fields, consider the JsonPlug module.
RENEXTRA
is typically used in combination with REPLACEFIELD
(see below).
The syntax is:
moduleVariable.RENEXTRA(result, arguments)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.result
is the object containing the analysis results.-
arguments
is an object containing the parameters to be used. Such parameters are:templateName
is the template name of the records to act upon. It can be an asterisk (*
), which means "any template".newTemplateName
is the new template name that will replace the old one. Ifnull
, the template name is not changed, so usenull
if you only want to rename fields.-
renameRules
is an array containing objects each representing a field rename rule and having the following properties:name
: old field name.new
: new field name.
If
null
, fields are not renamed, so usenull
if you only want to change the template name.
The method also supports a purely parametric syntax, that is:
moduleVariable.RENEXTRA(result, templateName, newTemplateName, renameRules)
REPLACEFIELD
The REPLACEFIELD
method changes all the matches of a regular expression inside the values of all the fields that have been renamed using the RENEXTRA
method.
Consider this template:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Job,
@Product,
@Company,
@Role
}
If the following rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
<1:5>
@Job[LEMMA("software engineer")]
<1:4>
@Company[TYPE(COM)]
}
}
is applied to this input text:
George Dickinson is a software engineer for Acme Ltd.
you will normally get this record:
Template: PERSONAL_DATA
Field | Value |
---|---|
@Name | George Dickinson |
@Job | software engineer |
@Company | Acme Ltd. |
With this code:
function onFinalize(result) {
var renameRules = [{name: "Name", new: "PseudoName"}]
normalizepost.RENEXTRA(result, {
templateName: "PERSONAL_DATA",
newTemplateName: null,
renameRules: renameRules
})
normalizepost.REPLACEFIELD(result, {
regularExpression: "Name",
replaceValue: "John Doe"
})
return result;
}
the record becomes:
Template: PERSONAL_DATA
Field | Value |
---|---|
@PseudoName | John Doe |
@Job | software engineer |
@Company | Acme Ltd. |
The syntax is:
moduleVariable.REPLACEFIELD(result, arguments)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.result
is the object containing the analysis results.-
arguments
is an object containing the parameters to be used. Such parameters are:regularExpression
is a string containing a regular expression used to find the parts of fields' value to change.replaceValue
is the string that replaces all the matches ofregularexpression
in fields' values.
The method also supports a purely parametric syntax, that is:
moduleVariable.REPLACEFIELD(result, regularExpression, replaceValue)
SPLIT_NORM_LIST_REPLACE
Use the SPLIT_NORM_LIST_REPLACE
to set an array of replacement objects starting from a list file. This array can be used as value of the normalizationRules
parameter of the FIELD_NORM_STRING
method when the value of its normalizationType
parameter is REPLACE.
For example, consider this template:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Address,
@Phone,
@Job
}
If this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
<>
@Job[LEMMA("software engineer")]
}
}
is applied to this text:
John is a software engineer.
you will get:
Template: PERSONAL_DATA
Field | Value |
---|---|
@Name | John |
@Job | software engineer |
With this code:
function onFinalize(result) {
var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE("replacements.cl")
normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE", replacementList);
return result;
}
and if the replacements.cl
list file has these contents:
programmer=software engineer|developer|dev
you will get:
Template: PERSONAL_DATA
Field | Value |
---|---|
@Name | John |
@Job | programmer |
The syntax of SPLIT_NORM_LIST_REPLACE
is:
moduleVariable.SPLIT_NORM_LIST_REPLACE(listFileName)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.listFileName
is the list file name. List files should be placed in a folder callednormalizepost
, which in turn must be located either in therules
or in themodules
folder of the project.
Note
- If you have two subfolders containing the same list file in both locations, the
normalizepost
folder under therules
folder takes precedence. - If the list is not found within the
normalizepost
folder, the method will attempt to load the entire path as if it starts from therules
folder.
Each line of the list file must have this syntax:
newValue=valueToReplace_1[|valueToReplace_2...|valueToReplace_n]
where:
newValue
is the new value.valueToReplace_n
is a value to replace.
If newValue
is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
SPLIT_NORM_LIST_REPLACE_REGEX
Like SPLIT_NORM_LIST_REPLACE
with the difference that list files are compiled with JavaScript regular expressions. The array can be used as value of the normalizationRules
parameter of the FIELD_NORM_STRING method when the value of its normalizationType
parameter is REPLACE_REGEX.
For example, consider this template:
TEMPLATE(PERSONAL_DATA)
{
@Name,
@Address,
@Phone,
@Job
}
If this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@Name[TYPE(NPH)]
<>
@Job[LEMMA("software developer")]
}
}
is applied to this text:
John is a software developer.
you will get this record:
Template: PERSONAL_DATA
Field | Value |
---|---|
@Name | John |
@Job | software developer |
With this code:
function onFinalize(result) {
var replacementList = normalizepost.SPLIT_NORM_LIST_REPLACE_REGEX("replacements.cl")
normalizepost.FIELD_NORM_STRING(result, "PERSONAL_DATA", "Job", "REPLACE_REGEX", replacementList);
return result;
}
and if the replacements.cl
list file has these contents:
$2 programmer=\b((software) developer)\b/i
you will get this record:
Template: PERSONAL_DATA
Field | Value |
---|---|
@Name | John |
@Job | software programmer |
The syntax of SPLIT_NORM_LIST_REPLACE_REGEX
is:
moduleVariable.SPLIT_NORM_LIST_REPLACE_REGEX(listFileName)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.listFileName
is the list file name. List files should be placed in a folder callednormalizepost
, which in turn must be located either in therules
or in themodules
folder of the project.
Note
- If you have two subfolders containing the same list file in both locations, the
normalizepost
folder under therules
folder takes precedence. - If the list is not found within the
normalizepost
folder, the method will attempt to load the entire path as if it starts from therules
folder.
Each line of the list file must have this syntax:
replacementString=regularExpression
where:
replacementString
is the replacement string which can possibly contain reference to capturing groups like $1, $2, etc.regularExpression
is the JavaScript regular expression.
If replacementString
is empty, the field is deleted. If, as consequence of replacement with empty strings, all of a record's fields are deleted, the entire record is deleted too.
load
The load
method prepares one or more of the operations that can be attained with the methods above, but using as its source a configuration file generated when importing a project created with a legacy edition of Studio. Prepared operations are then applied using the apply
method.
Warning
The use of the load
method is not required in cases other than that indicated above and the import procedure already generates the appropriate statements inside the main.jr
file, so there are basically no cases in which you have to write code that uses this method.
For example, when importing an old project, Studio may generate this code:
function initialize(cmdline) {
if (!normalizepost.load('Config.xml')) {
CONSOLE.error(normalizepost.getLastError());
return false;
}
return true;
}
function onFinalize(result) {
result = normalizepost.apply(result);
return result;
}
Its syntax is:
moduleVariable.load(configPath)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.configPath
is theconfig.xml
file path generated by the import procedure.
The method returns true
in case of success, false
otherwise. In case of failure it sets an error message you can retrieve with the getLastError
method.
apply
The apply
method performs all the operations prepared with the invocation of the load
method.
For example:
function onFinalize(result) {
result = normalizepost.apply(result);
return result;
}
The syntax is:
moduleVariable.apply(result)
where:
moduleVariable
is the variable corresponding to the module and set withrequire()
.result
is the object containing the analysis results.
getLastError
The getLastError
method retrieves the message corresponding to the last error that occurred when theload
method fails. Use it to display the error message.
For example:
function initialize(cmdline) {
if (!normalizepost.load('Config.xml')) {
CONSOLE.error(normalizepost.getLastError());
return false;
}
}
The syntax is:
moduleVariable.getLastError()
where moduleVariable
is the variable corresponding to the module and set with require()
.
close
The close
method is used to free up the resources allocated by the normalizepost module object.
It's not mandatory to invoke this method, but if you decide to do it, invoke it inside the shutdown
function.
For example:
function shutdown() {
normalizepost.close();
}
The syntax is:
moduleVariable.close()
where moduleVariable
is the variable corresponding to the module and set with require()
.