SCRIPT attribute
Overview
The SCRIPT
attribute allows the user to integrate scripting functions into categorization and extraction rules.
A function is invoked in the attribute and its value—true or false— is the function return value.
By using SCRIPT
in combination with all other standard attributes, it is possible to perform very powerful and project-oriented reasoning.
The syntax for the SCRIPT
attribute is:
SCRIPT("functionName")
In order to be referenced, a scripting function must be defined inside the main.jr file:
function functionName(token_index, param) {
// place code here
return true;
}
where:
token_index
is the token ordinal number in the disambiguation output.param
is an optional parameter that can be passed by the attribute.
The SCRIPT
attribute expects a function to return either a true
of a false
.
Wrapped calls to separate JR modules are also supported, for example:
function functionName(token_index, param) {
return myScript.functionName(token_index, param);
}
This will apply within a rule the functionName
function exported from the module myScript.
An alternative syntax for the SCRIPT
attribute is:
SCRIPT("functionName[:parameter", "function2Name:parameter",...])
where parameter
is the function parameter and corresponds to the value of the param
parameter.
In case of more functions separated by a comma—acting as an OR
operator—one of the functions must return true so that the SCRIPT
attribute is valid.
Info
If an exception occurs during the execution of the specified functions, the engine does not stop, but the SCRIPT
attribute is evaluated as false
.
Being able to navigate the whole disambiguation output, the function can add very specific constraints to the attribute to which it relates.
Consider the following extraction rule:
SCOPE SENTENCE
{
IDENTIFY(PEOPLE)
{
@NAME[TYPE(NPH) + SCRIPT("excludingLemma:John Smith")]
}
}
It aims to extract people's names except John Smith.
The following function cancels the extraction of the LEMMA
passed as parameter.
function excludingLemma(token_index, lemma) {
var token = DIS.getToken(token_index);
if (token.lemma==lemma)
return false;
return true;
}
By leveraging the disambiguation output and being able to perform a check on the base forms, this function is able to avoid the extraction of John Smith even when an abbreviated form of the name occurs in the text.
Using SCRIPT alone and in combination
You can use the SCRIPT
attribute both alone or in combination with other attributes. The second case is recommended. For example, consider this user-defined function:
function textIsTitleCase(index) {
// constant regex
var title_case = /^([A-ZÀ-Ÿ][a-zà-ÿ]*)\b/;
// Get text from token
var text = DIS.getTokenText(index);
return title_case.test(text)
}
This function will extract tokens only in title case.
For example, with this template:
TEMPLATE(WORD_CASE)
{
@TITLE_CASE
}
this extraction rule:
SCOPE SENTENCE
{
IDENTIFY(WORD_CASE)
{
@TITLE_CASE[SCRIPT("textIsTitleCase")]
}
}
applied to this text:
John lives in London. His brother lives in Manchester.
you will get these records:
Template: WORD_CASE
Field | Value |
---|---|
@TITLE_CASE | John |
Template: WORD_CASE
Field | Value |
---|---|
@TITLE_CASE | London |
Template: WORD_CASE
Field | Value |
---|---|
@TITLE_CASE | His |
Template: WORD_CASE
Field | Value |
---|---|
@TITLE_CASE | Manchester |
With this rule:
SCOPE SENTENCE
{
IDENTIFY(WORD_CASE)
{
@TITLE_CASE[TYPE(ADJ:p) + SCRIPT("textIsTitleCase")]
}
}
you will get this record:
Template: WORD_CASE
Field | Value |
---|---|
@TITLE_CASE | His |
While in the first case all tokens are analyzed by the script, in the last case the script acts on a filtered list of tokens thanks to the TYPE
attribute.