mergepost

Overview

The mergepost module allows you to manipulate the extraction output.

It has the following methods:

FIELD_CLONE
MERGE_BY_VALUE
MERGE_BY_INSTANCE
MERGE_BY_INSTANCE_BEGIN_ONLY
MERGE_BY_OVERLAP
REPLACE_FIELD_VALUE
RECORD_CLONE
load
apply
getLastError

When in Studio you install the mergepost module in your project, Studio modifies the main.jr file to insert this statement at the beginning of the file:

var mergepost = require("modules/mergepost");

The statement above sets a variable with an instance of the module so that you can use it inside event handling functions.

All methods except for load and getLastError must be used in the onFinalize function, because they act on the analysis results available when this function is run.

The load method must be used in the initialize function, because it is the right place for the initialization of objects needed in other event handling functions.

The getLastError method must be used in the initialize function, because it retrieves the message corresponding to the last error that occurred when the load method fails.

RECORD_CLONE

The RECORD_CLONE method allows you to clone a record with its fields.

For example, consider these two templates:

TEMPLATE(ATHLETES)
{
    @FULL_NAME,
    @AGE,
    @DATE_OF_BIRTH,
    @PLACE_OF_BIRTH
}

TEMPLATE(OLYMPIC_CHAMPIONS)
{
    @FULL_NAME,
    @AGE,
    @DATE_OF_BIRTH,
    @PLACE_OF_BIRTH
}

If this rule

SCOPE SENTENCE 
{
    IDENTIFY(ATHLETES)
    {
        @FULL_NAME[TYPE(NPH)]
        <>
        @AGE[PATTERN("[1-9][0-9]")]
        <>
        @DATE_OF_BIRTH[TYPE(DAT)]
        <>
        @PLACE_OF_BIRTH[SYNCON(100005092)]//@SYN: #100005092# [Baltimore]
    }
}

is applied to this text:

Michael Phelps (38) was born on the 30th of June 1985 in Baltimore.

you will get this record:

Template: ATHLETES

Field	Value
@FULL_NAME	Michael Phelps
@AGE	38
@DATE_OF_BIRTH	Jun-30-1985
@PLACE_OF_BIRTH	Baltimore

With this code:

function onFinalize(result) {
    mergepost.RECORD_CLONE(result, {
        templateName: "ATHLETES",
        newTemplateName: "OLYMPIC_CHAMPIONS"
    })
    return result
}

you will get these records:

Template: ATHLETES

Field	Value
@FULL_NAME	Michael Phelps
@AGE	38
@DATE_OF_BIRTH	Jun-30-1985
@PLACE_OF_BIRTH	Baltimore

Template: OLYMPIC_CHAMPIONS

Field	Value
@FULL_NAME	Michael Phelps
@AGE	38
@DATE_OF_BIRTH	Jun-30-1985
@PLACE_OF_BIRTH	Baltimore

The syntax for RECORD_CLONE is:

moduleVariable.RECORD_CLONE(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the name of the template on which the clonation is based.
- newTemplateName is the cloned template name.

The method also supports a parametric syntax, that is:

moduleVariable.RECORD_CLONE(result, templateName, newTemplateName)

FIELD_CLONE

The FIELD_CLONE method creates a new field whose value is a copy of the value of an existing field.
It must be used in the onFinalize function when extractions results are available.

Consider the following template:

TEMPLATE(PERSONAL_DATA)
{
    @NAME,
    @NAME_INITIALS,
    @ADDRESS,
    @PHONE
}

When the following rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @NAME[TYPE(NPH)]
    }
}

is applied to this text:

Sherlock Holmes is a private investigator.

you get this record:

Template: PERSONAL_DATA

Field	Value
@NAME	Sherlock Holmes

With this code:

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "NAME",
        clonedFieldName: "NAME_INITIALS"
    })
    return result
}

the output becomes:

Template: PERSONAL_DATA

Field	Value
@NAME_INITIALS	Sherlock Holmes
@NAME	Sherlock Holmes

In a common use case, the value of the new field is then processed. You can see an example of this in the description of the REPLACE_FIELD_VALUE below.

The syntax of the method is:

moduleVariable.FIELD_CLONE(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records in which to create the new field. It can be an asterisk (*), which means "any template".
- fieldName is the name of the field to clone.
- clonedFieldName is the name of the clone field.

The method also supports a parametric syntax, that is:

moduleVariable.FIELD_CLONE(result, templateName, fieldName, clonedFieldName)

MERGE_BY_VALUE

The MERGE_BY_VALUE method merges all the records of a template when they have the same value for one or more given fields.
It must be used in the onFinalize function when extraction results are available.

For example, consider the following template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Job,
    @Product,
    @Company,
    @Role
}

If these rules:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)] 
        <1:5>
        @Role[LEMMA("purchaser")]
        <1:4>
        @Product[KEYWORD("iPhone")]
    }

    IDENTIFY(PERSONAL_DATA)
    {

        @Job[LEMMA("software engineer")]
        <1:5>
        @Company[TYPE(COM)]
        <1:4>
        @Name[TYPE(NPH)]
    }
}

are applied to this input text:

John Markovitch is the first purchaser of the new iPhone. The best software engineer for Samsung is John Markovitch.

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Role	purchaser
@Product	iPhone
@Name	John Markovitch

Template: PERSONAL_DATA

Field	Value
@Name	John Markovitch
@Job	software engineer
@Company	Samsung

As you can see, there are two records about the same person (John) with different information.
If the rules above are applied to the same input text and you have this code:

function onFinalize(result) {
    mergepost.MERGE_BY_VALUE(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: ["Name"]
    })
    return result;
}

you will get the following record:

Template: PERSONAL_DATA

Field	Value
@Role	purchaser
@Product	iPhone
@Name	John Markovitch
@Job	software engineer
@Company	Samsung

The MERGE_BY_VALUE method gathered all the information around the "attractor field" value, in this case John for field Name, creating a unique record.

Both occurrences of the @Name field have been merged. In case other non-attractor fields occur more than once with the same extracted value, their occurrences will also be merged.

Note

When merging instances with different confidence scores, the field confidence score is calculated with the same formula used for the confidence of instances. The same happens with MERGE_BY_INSTANCE and MERGE_BY_INSTANCE_BEGIN_ONLY.

The syntax of the MERGE_BY_VALUE method is:

moduleVariable.MERGE_BY_VALUE(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to merge.
- aggregatorFields is an array containing the field names around which the information is gathered. In case of a single field, you can also use a string instead of an array.
- inhibitorFields is an optional array containing the names of fields that, if contained in a record, prevent it from being merged. In case of a single field, you can also use a string instead of an array. If undeclared, no inhibitors will be checked.
- caseInsensitiveFlag is an optional boolean that, if set to true, triggers the merge operation in a case-insensitive manner. In this scenario, the winner's case is determined by the initial field value around which the information is consolidated (refer to the example below for clarity).

For example, given the following template:

TEMPLATE(PERSONAL_DATA)
{
    @Name,
    @Job,
    @Product,
    @Company,
    @Role
}

and the following extraction rules with the TEXT transformer applied to the Name field:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH)]|[TEXT] 
        <1:5>
        @Role[LEMMA("purchaser")]
        <1:4>
        @Product[KEYWORD("iPhone")]
    }

    IDENTIFY(PERSONAL_DATA)
    {

        @Job[LEMMA("software engineer")]
        <1:5>
        @Company[TYPE(COM)]
        <1:4>
        @Name[TYPE(NPH)]|[TEXT]
    }
}

applied to this text:

JOHN MARKOVITCH is the first purchaser of the new iPhone. The best software engineer for Samsung is John Markovitch.

with this code:

function onFinalize(result) {
    mergepost.MERGE_BY_VALUE(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: ["Name"],
        caseInsensitiveFlag: true
    })
    return result;
}

you will get this record:

Template: PERSONAL_DATA

Field	Value
@Role	purchaser
@Product	iPhone
@Name	JOHN MARKOVITCH
@Job	software engineer
@Company	Samsung

Thanks to the true boolean, the merge is based on the first occurrence of the Name field value JOHN MARKOWITCH written in uppercase.

The method also supports a parametric syntax, that is:

moduleVariable.MERGE_BY_VALUE(result, templateName, aggregatorFields[, inhibitorFields])

or:

moduleVariable.MERGE_BY_VALUE(result, templateName, aggregatorFields, inhibitorFields[, caseInsensitiveFlag]);

Warning

When using the purely parametric call, though you don't need the inhibitorFields parameter, you must declare it empty if you need to use the caseFlag parameter.

MERGE_BY_INSTANCE

The MERGE_BY_INSTANCE method merges all the records of a template according to the instance of a value for one or more given fields. Unlike MERGE_BY_VALUE, this allows you to distinguish between identical textual values that do not correspond to the same entity.

The method must be used in the onFinalize function when extraction results are available.

Note

The (optional) use of the SOLITARY attribute is recommended when using this method.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name (S),
    @Address,
    @Phone
}

With these rules:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <-9:9>
        @Address[TYPE(ADR)]
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <:9>
        @Phone[TYPE(PHO)]
    }
}

applied to this text:

Steven lives in Baltimora Street and his phone number is 3333333333. His friend Mary bought a house in Tropicana Street where she lives with her husband Steven and his phone number is 3333333331.

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Name	Steven

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Name	Steven

With this code:

function onFinalize(result) {
    mergepost.MERGE_BY_INSTANCE(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: ["Name"]
    })
    return result;
}

and with this other one:

function onFinalize(result) {
    mergepost.MERGE_BY_VALUE(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: ["Name"]
    })
    return result;
}

you will get these records with the first one:

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Name	Steven
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Name	Steven
@Address	Tropicana Street

and these records with the second one:

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Phone	3333333330
@Name	Steven
@Address	Tropicana Street
@Address	Baltimora Street

As you can see, MERGE_BY_VALUE is based on identical textual values but it does not recognize different entities, while MERGE_BY_INSTANCE allows you to differentiate among different entities having a common textual value.

In case different fields—other than the field on which the merge is applied—occur more than once with the same extracted value, their occurrences will also be merged as well.

The syntax is:

moduleVariable.MERGE_BY_INSTANCE(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used Such parameters are:
- templateName is the template name of the records to merge.
- aggregatorFields is an array containing the field names around which the information is gathered. In case of a single field, you can also use a string instead of an array.
- inhibitorFields is an optional array containing the names of fields that, if contained in a record, prevent it from being merged. In case of a single field, you can also use a string instead of an array. If undeclared, no inhibitors will be checked.

The method also supports a parametric syntax, that is:

moduleVariable.MERGE_BY_INSTANCE(result, templateName, aggregatorFields[, inhibitorField])

MERGE_BY_INSTANCE_BEGIN_ONLY

Like MERGE_BY_INSTANCE, with the difference that the MERGE_BY_INSTANCE_BEGIN_ONLY method only considers the beginning of the value instance.

Note

The (optional) use of the SOLITARY attribute is recommended when using this method.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name (S),
    @Address,
    @Phone
}

With these rules:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <-9:9>
        @Address[TYPE(ADR)]
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <:9>
        @Phone[TYPE(PHO)]
    }
}

applied to this text:

Steven lives in Baltimora Street and his phone number is 3333333333. His friend Mary bought a house in Tropicana Street where she lives with her husband Steven and his phone number is 3333333331.

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Name	Steven

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Name	Steven

With this code:

function onFinalize(result) {
    mergepost.MERGE_BY_INSTANCE_BEGIN_ONLY(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: ["Name"]
    })
    return result;
}

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Name	Steven
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Name	Steven
@Address	Tropicana Street

The syntax is:

moduleVariable.MERGE_BY_INSTANCE_BEGIN_ONLY_ONLY(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to merge.
- aggregatorFields is an array containing the field names around which the information is gathered. In case of a single field, you can also use a string instead of an array.
- inhibitorFields is an optional array containing the names of fields that, if contained in a record, prevent it from being merged. In case of a single field, you can also use a string instead of an array. If undeclared, no inhibitors will be checked.

The method also supports a purely parametric syntax, that is:

moduleVariable.MERGE_BY_INSTANCE_BEGIN_ONLY_ONLY(result, templateName, aggregatorFields[, inhibitorFields])

MERGE_BY_OVERLAP

The MERGE_BY_OVERLAP method merges all the records of a template by comparing the offsets of the pivot fields and searching for a common position with an overlap. Unlike the MERGE_BY_INSTANCE method, this one allows partial extractions to be merged, as long as they have at least one position in common.

During the merge process, only the positions of the extracted values are used, not their actual values. If an overlap occurs, that combination of offsets is used as the pivot key, and the sibling fields are aggregated on the pivot key. However, it's also possible to declare inhibitory fields that skip the record during the merging process.

After the merge is complete, the user can choose to normalize the extracted value on the longest extraction or by generating a new value using the minimum and maximum offset of the newly merged field. If the user chooses the latter option, there is a mechanism in place to prevent the creation of abnormally long values, as explained below.

For example, consider this template:

TEMPLATE(PERSONAL_DATA)
{
    @Name (S),
    @Address,
    @Phone
}

With these rules:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <-99:99>
        @Address[TYPE(ADR)]
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Name[TYPE(NPH) - TYPE(PRO)]
        <-99:99>
        @Address[KEYWORD("house in")]|[TEXT #1]
        >
        @Address[TYPE(ADR)]|[TEXT #2]
    }

    IDENTIFY(PERSONAL_DATA)
    {
        @Address[TYPE(ADR)]
        <:99>
        @Phone[TYPE(PHO)]
    }
}

applied to this text:

Steven lives in Baltimora Street and his phone number is 3333333333. His friend Mary bought a house in Tropicana Street where she lives with her husband Steven and his phone number is 3311133331.

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Name	Steven
@Address	house in Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Name	Mary
@Address	house in Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Address	Baltimora Street

Template: PERSONAL_DATA

Field	Value
@Phone	3311133331
@Address	Tropicana Street

With this code:

function onFinalize(result) {
    mergepost.MERGE_BY_OVERLAP(result, {
        templateName: "PERSONAL_DATA",
        aggregatorFields: "Address",
        normalizationType: "extension",
        maxExtensionLength: -1
    }
    return result;
}

you will get these records:

Template: PERSONAL_DATA

Field	Value
@Phone	3333333331
@Name	Steven
@Name	Mary
@Address	Tropicana Street

Template: PERSONAL_DATA

Field	Value
@Phone	3333333333
@Name	Steven
@Address	Baltimora Street

The syntax is:

moduleVariable.MERGE_BY_OVERLAP(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to merge.
- aggregatorFields is an array containing the field names around which the information is gathered. In case of a single field, you can also use a string instead of an array.
- inhibitorFields is an optional array containing the names of fields that, if contained in a record, prevent it from being merged. In case of a single field, you can also use a string instead of an array. If undeclared, no inhibitors will be checked.
- normalizationType is an optional parameter that can be:
  - either longest or extension (case insensitive)
  - if undeclared, it defaults to longest
  - if set to longest, the longest extracted value among the merged pivots is used for normalization.
  - if set to extension, a new extracted value is created by taking the minimum begin and maximum end values of the merged extractions. This new value will be a substring of the original text, ignoring any previous normalization (if any was applied).
- maxExtensionLength is an optional numerical parameter that can be used to set a maximum length for the normalized value if normalizationType is set to extension:
  - if the parameter is not declared, the default value of 100 will be used.
  - if the normalized value exceeds the threshold, it will be discarded and the longest extracted value will be used instead. A warning will be printed in the console.
  - if the parameter is set to -1, any length will be accepted. This option is suggested when other types of merging operations are used before invoking the MERGE_BY_OVERLAP method, as it helps avoid unexpectedly long values.

The method also supports a parametric syntax, that is:

moduleVariable.MERGE_BY_OVERLAP(result, templateName, aggregatorFields[, inhibitorFields, normalizationType, maxExtensionLength])

REPLACE_FIELD_VALUE

The REPLACE_FIELD_VALUE changes field values.
It must be used in the onFinalize function when extraction results are available.

Consider the same example used to describe the FIELD_CLONE method, which creates a clone of an existing field. Then this code:

function onFinalize(result) {
    mergepost.FIELD_CLONE(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "NAME",
        clonedFieldName: "NAME_INITIALS"
    })
    mergepost.REPLACE_FIELD_VALUE(result, {
        templateName: "PERSONAL_DATA",
        fieldName: "NAME_INITIALS",
        eraseNotFound: false,
        replaceRule: {"Sherlock Holmes": "S.H.", "John Watson": "J.W."}
    })
    return result;
}

creates the clone field and changes its value. The output becomes:

Template: PERSONAL_DATA

Field	Value
@NAME_INITIALS	S. H.
@NAME	Sherlock Holmes

The syntax is:

moduleVariable.REPLACE_FIELD_VALUE(result, arguments)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.
arguments is an object containing the parameters to be used. Such parameters are:
- templateName is the template name of the records to act upon.
- fieldName is the field name.
- eraseNotFound is a boolean. It can be:
  - true: all the fields not matching the replacement rules (see below) are deleted from their respective records. If, as a consequence of fields removal, the record becomes empty, also the record is deleted.
  - false: all the fields not matching the replacement rules are left untouched.
- replaceRule is an object representing replacement rules. Each property name is interpreted as the value to replace, while the property value is the replacement value.

The method also supports a parametric syntax, that is:

moduleVariable.REPLACE_FIELD_VALUE(result, templateName, fieldName, eraseNotFound, replaceRule)

load

The load method prepares one or more of the operations that can be attained with the methods above, but using as its source a configuration file generated when importing a project created with a legacy edition of Studio. Prepared operations are then applied using the apply method.

Warning

The use of the load method is not required in cases other than that indicated above and the import procedure already generates the appropriate statements inside the main.jr file, so there are basically no cases in which you have to write code that uses this method.

For example, when importing an old project, Studio may generate this code:

var mergepost = require("modules/mergepost");

function initialize(cmdline) {
    if (!mergepost.load('Config.xml')) {
        CONSOLE.error(mergepost.getLastError());
        return false;
    }
    return true;
}

function onFinalize(result) {
    result = mergepost.apply(result);
    return result;
}

The syntax is:

moduleVariable.load(configPath)

where:

moduleVariable is the variable corresponding to the module and set with require().
configPath is the path of the configuration file generated by the import procedure.

The method returns true in case of success, false otherwise. In case of failure it sets an error message you can retrieve with the getLastError method.

apply

The apply method performs all the operations prepared with the invocation of the load method.
It must be used in the onFinalize function when extractions results are available.

For example:

function onFinalize(result) {
    result = mergepost.apply(result);
    return result;
}

The syntax is:

moduleVariable.apply(result)

where:

moduleVariable is the variable corresponding to the module and set with require().
result is the object containing the analysis results.

getLastError

The getLastError method retrieves the message corresponding to the last error that occurred when the load method fails. Use it to display the error message.

For example:

function initialize(cmdline) {
    if (!mergepost.load('Config.xml')) {
        CONSOLE.error(mergepost.getLastError());
        return false;
    }
}

The syntax is:

moduleVariable.getLastError()

where moduleVariable is the variable corresponding to the module and set with require().