Transformation of structured entities

Overview

In extraction rule writing, a transformer is an extraction syntax that allows the user to define the way in which some types of structured entities can be divided into the different elements which compose them. These components are then transferred into different fields of a template. Such structured entities are, in fact, identified by the disambiguator during the disambiguation process as predictable combinations of letters, numbers and also symbols whose overall semantic value is stronger than the semantic value of the components considered individually.

For example, the disambiguator is able to recognize that a string composed by a number and the proper name of a street is an address. A date can be recognized as such because it consists of a particular and predictable combination of numbers (for example, 02, 2009, '70), words (for example, Jul, July) and other non-alphanumeric characters (for example, /, -). This part of the disambiguation process generates types of entities such as: NPH (person's name), ADR (address), DAT (date), etc. These types of entities can be either treated as a whole (a complete address, a complete date etc.) or can be broken down into their constituent elements.

The complete transformer syntax is the following:

TEMPLATE(templateName)
{
    @field_1,
    @field_2,
    ...
    @field_n
}

DEFINE transformerName = TRANSFORM(ID) IN TEMPLATE(templateName)
{
    FIELD(ID) IN field_1
    FIELD(ID) IN field_2
    ...
}

SCOPE scopeOption
{
    IDENTIFY(templateName)
    {
        @@transformerName[attribute]
    }
}

where:

TEMPLATE, DEFINE, TRANSFORM, IN, FIELD, SCOPE, IDENTIFY are language keywords and must be written in uppercase as shown above.
ID refers to the unique (unambiguous) identifier that belongs to each and every syncon found in the Knowledge Graph. It is always a whole number made up of one or several digits.
templateName, field_# and transformerName refer, respectively, to the name to be assigned to the template, the extraction fields and the transformer.

Warning

The transformer name can't consist of only—and start with—numbers.

scopeOption refers to the part of a rule syntax that allows you to specify the portion of text that the rule must act upon.
attribute can be one of the attributes available in the Rules language.

Warning

The syntax will accept any of the available attributes, but to make the transformer as effective as possible, it is important to carefully choose the attribute and the related value so that all of the target entities are matched in the input text.

This syntax can be split into three main parts:

The template definition.
The transformer in the strict sense.
The extraction rule(s).

The TEMPLATE contains the information found in the entity constituent elements. These elements will be organized into template fields. Located in the transformer block, there are the DEFINE line or header and the transformer body. The DEFINE line associates the metadata records with the template. Here, the transformer is given a name (transformername) and the action to be performed is defined.
The code above can be described as follows: transpose the extracted information associated to the syncon ID to the previously defined template. The transformer body associates each entity record with the template fields. Each field will contain the extracted information associated to the syncon ID corresponding to each entity constituent element. The final elements which complete the syntax are the rules (or rule) extracting the values that must be transformed. The rule identifies the previously created template, then uses one or more attributes to extract data from a text. Unlike standard extraction rules, this particular rule does not extract a value directly to a field (@fieldname), but uses the name of the transformer with a double at sign (@@). The transformer is then responsible for splitting the data and associating it to the different template fields.

Consider the management of people's names, one of the structured type of entities identified by the disambiguator. By default, people's names are analyzed as virtual children of syncon 784526 (person). The names can usually be split into first name and last name and also implicitly contain information about the gender of the person, since it is often possible to recognize if a first name is masculine or feminine.

The following paragraph details how to configure the transformer in order to reach the desired result.

TEMPLATE(PERSON)
{
    @first_name,
    @family_name,
    @gender 
}

The first step is to define a template, in this case PERSON: it contains the information which breaks down people's names into its constituent elements. These elements are organized in the template fields called first_name, family_name and gender.

Next, the DEFINE line establishes where the following actions are performed:

DEFINE Person = TRANSFORM (78452) IN TEMPLATE (PERSON)

The transformer is given a name (Person) and the action to be performed is expressed. The action can be described as follows: transpose the information contained in the extracted values associated to syncon 78452 (which is the supernomen, the virtual parent, for any person's name) to the template called PERSON.

Finally, the transformer body associates each entity record with the template fields:

{
    FIELD(29307) IN first_name
    FIELD(29305) IN family_name
    FIELD(24254) IN gender
}

The template field first_name contains the extracted information associated with syncon 29307 (the concept of first name in the Knowledge Graph); the field family_name contains the information associated with syncon 29305 (the concept of last name in the Knowledge Graph) and the field gender contains the information associated with syncon 24254 (the concept of gender in the Knowledge Graph)

The last element that completes the syntax is the creation of one or more rules extracting the people's names that must be transformed:

SCOPE SENTENCE
{
    IDENTIFY(PERSON)
    {
        @@Person[TYPE(NPH)]
    }
}

Now consider the following example of template, transformer and extraction rule:

TEMPLATE(PERSON)
{
    @first_name,
    @family_name,
    @gender
}

DEFINE Person = TRANSFORM (78452) IN TEMPLATE (PERSON)
{
    FIELD(29307) IN first_name
    FIELD(29305) IN family_name
    FIELD(24254) IN gender
}

SCOPE SENTENCE
{
    IDENTIFY(PERSON)
    {
        @@Person[TYPE(NPH)]
    }
}

along with the sample sentence below:

To stand your ground in the face of relentless criticism from a double Nobel prize-winning scientist takes a lot of guts. For engineer and materials scientist Dan Shechtman, however, years of self-belief in the face of the eminent Linus Pauling's criticisms led him to the ultimate accolade: his own Nobel prize.
Shechtman was the sole winner of the Nobel prize for chemistry in 2011, for his discovery of seemingly impossible crystal structures in metal alloys.

The extraction rule identifies the people's names Linus Pauling and Dan Schechtman and the transformer breaks them down into its components, transposing their content to the fields in the predefined template. The final result is:

Template: PERSON

@first_name	@family_name	@gender
Linus	Pauling	M

Template: PERSON

@first_name	@family_name	@gender
Dan	Shechtman	M

Use of constants

Variations to the standard transformer syntax may be applied if necessary. For example, the fields content may be manipulated by adding a constant value instead of a varying value extracted from a text. The syntax is the following:

TEMPLATE(templateName)
{
    @field_1,
    @field_2,
    ...
    @field_n
}

DEFINE transformerName = TRANSFORM(ID) IN TEMPLATE(templateName)
{
    constant + FIELD(ID) IN field_1
    FIELD(ID) + constant IN field_2
    ...
    FIELD(ID) IN field_n
}

SCOPE scopeOption
{
    IDENTIFY(templateName)
    {
        @@transformerName[attribute]
    }
}

where constant refers to an invariable value that will always be added to the extraction output. The constant must be typed in between quotation marks ("...").

For example, if it is required to add name: before a person's first name, the sample transformer defined above should be modified as follows:

DEFINE Person = TRANSFORM (78452) IN TEMPLATE (PERSON)
{
    "name: " + FIELD(29307) IN first_name
    FIELD(29305) IN family_name
    FIELD(24254) IN gender
}

Note

Any punctuation mark or spaces which need to appear in the final output must be defined in the constant value.

If the rule above was applied to the sample text, the results would be:

Template: PERSON

@first_name	@family_name	@gender
name: Linus	Pauling	M

Template: PERSON

@first_name	@family_name	@gender
name: Dan	Shechtman	M

The plus sign (+) can also be used to combine different tokens within the same field.
Consider this syntax:

TEMPLATE(templateName)
{
    @field_1,
    @field_2,
    ...
    @field_n
}

DEFINE transformerName = TRANSFORM(ID) IN TEMPLATE(templateName)
{
    constant + FIELD(ID) + FIELD(ID) IN field_1
    FIELD(ID) IN field_2
    ...
    FIELD(ID) IN field_n
}

SCOPE scopeOption
{
    IDENTIFY(templateName)
    {
        @@transformerName[attribute]
    }
}

This use of the plus sign can be adopted for the example rule like this:

TEMPLATE(PERSON)
{
    @full_name,
    @gender
}

DEFINE Person = TRANSFORM (78452) IN TEMPLATE (PERSON)
{
    "full name: " + FIELD(29305) + ", " + FIELD(29307) IN full_name
    FIELD(24254) IN gender
}

SCOPE SENTENCE
{
    IDENTIFY(PERSON)
    {
        @@Person[TYPE(NPH)]
    }
}

In this case, the template has two fields instead of three. The transformer puts the constant full name: before the first token to be extracted, which is the family name (syncon 29305). After the family name, a second constant value will be inserted (a comma) followed by a space. The final token to close the content of the full_name field is the first name (syncon 29307).

If the rule above is applied to the sample text, the final output will be a normalized extraction record containing the full name of the person (ordered by family name) along with information about the gender.

Template: PERSON

@full_name	@gender
full name: Pauling, Linus	M

Template: PERSON

@full_name	@gender
full name: Shechtman, Dan	M

In other cases, it is necessary to deal with the fact that the transformer may not be able to identify all the data stored in a template. In other words, in the event that no token is matched by the transformer, the defined fields can not be processed. In these cases, those fields can be filled with a default string (for example ***) which serves as a placeholder for the blank field.

The syntax is:

TEMPLATE(templateName)
{
    @field_1,
    @field_2,
    ...
    @field_n
}

DEFINE transformerName = TRANSFORM(ID) IN TEMPLATE(templateName)
{
    FIELD(ID:constant) IN field_1
    FIELD(ID) IN field_2
    ...
    FIELD(ID) IN field_n
}

SCOPE scopeOption
{
    IDENTIFY(templateName)
    {
        @@transformerName[attribute]
    }
}

When written between quotation marks, constant becomes the default value when the FIELD(ID) cannot be found.

Consider the following example:

DEFINE Person = TRANSFORM (78452) IN TEMPLATE (PERSON)
{
    FIELD(29307:"---") IN first_name
    FIELD(29305:"*-*-") IN family_name
    FIELD(24254) IN gender
}

The first name field will be set to ---, if no first name is found in the text, and field family_name will be set to *-*-, if no value is recognized to be a family name. The output of this configuration will return either an extracted value, when it is present, or the predefined string, if no value is available.

Consider a variation of the previous sample text in which the first name Dan is omitted:

To stand your ground in the face of relentless criticism from a double Nobel prize-winning scientist takes a lot of guts. For engineer and materials scientist Shechtman, however, years of self-belief in the face of the eminent Linus Pauling's criticisms led him to the ultimate accolade: his own Nobel prize.
Shechtman was the sole winner of the Nobel prize for chemistry in 2011, for his discovery of seemingly impossible crystal structures in metal alloys.

With the new transformer, the extraction becomes:

Template: PERSON

@first_name	@family_name	@gender
Linus	Pauling	M

Template: PERSON

@first_name	@family_name	@gender
---	Shechtman	M

Shechtman has been recognized as person's name even though the first name is missing, the entity has been extracted and processed by the transformer to fill in the fields of the PERSON template, and the field @first_name has been set with the default string.