Skip to content

TRANSFORM TOKEN

In extraction rule writing, TRANSFORM TOKEN is a syntax comparable to the simple TRANSFORM syntax. Both syntaxes associate extraction data with a template, but whereas TRANSFORM only performs a transformation of structured entities, TRANSFORM TOKEN can transform any kind of entity or concept. In particular, with the TRANSFORM TOKEN syntax, a value can be assigned to several template fields by applying different transformation methods (options) to an extracted token.

The complete TRANSFORM TOKEN syntax is the following:

TEMPLATE(templateName)
{
    @field_1,
    @field_2,
    ...
    @field_n
}

DEFINE transformerName = TRANSFORM(TOKEN) IN TEMPLATE(templateName)
{
    constant IN field_1
    TRANSFORMATION (transformationOption) IN field_2
}

SCOPE scopeOption
{
    IDENTIFY(templateName)
    {
        @@transformerName[attribute]
    }
}

where:

  • TEMPLATE, DEFINE, TRANSFORM, TOKEN, IN, TRANSFORMATION, SCOPE and IDENTIFY are language keywords which must be written in uppercase as shown above.
  • constant refers to an invariable value that will always be added to the extraction output.
  • transformationOption refers to any of the transformation options available.
  • templateName, field_# and transformerName refer respectively to the name that the user assigns to the template, the extraction fields and the transformer.
  • scopeOption refers to the part of a rule syntax which specifies the portion of the text to be considered when evaluating the rule.
  • attribute refers to any of the attributes available in the Rules language.

This syntax can be divided into three main parts:

  • The template definition.
  • The transformer in the strict sense.
  • The extraction rule(s).

The template contains the names of the fields to be transformed. The transformation block contains the DEFINE line and the transformation option. The DEFINE line associates the metadata with the template, provides the transformer with a name (transformername), and expresses the action to be performed.

The line can be described as follows: transpose any extracted information (TOKEN) to the previously defined template. The transformer body associates each entity record with the template fields. In detail, each field can contain constant values and/or normalized extraction tokens.

The last elements to complete the syntax are the rules (or rule) that extract the values that must be transformed. The rule identifies the previously created template and uses one or more attributes to extract data from text. Unlike standard extraction rules, this particular rule does not extract a value directly to a field (@fieldname[]). It uses the name of the transformer with the double at sign (@@), which is then responsible for sorting the data and associating it to the different template fields.

For example, consider a project in which proper names of infrastructures (airports, harbors, highways and train stations) must be mined from the text. A standard extraction project would require at least one field for each type of infrastructure, where the field names indicate the type while the entity proper names are the values extracted from the text. However with the TRANSFORM TOKEN syntax, you can:

  • Define just one field for any type of infrastructure and one field for any entity proper name (efficiency).
  • Define just the macro-types of entities to be extracted, thus having the option to easily reduce or expand the project's scope.(scalability).
  • Transform the proper name (token) in a variety of ways and combine different types of token transformation (flexibility).

The following paragraphs describe how the transformer should be configured to reach the desired result.

The first step to configure a TRANSFORM TOKEN option is to create the template that will receive the extracted values. In this case, it is called INFRASTRUCTURES and it contains two fields:

  • The type of infrastructure extracted (@type).
  • The proper name of the extracted infrastructure (@name).

The template is:

TEMPLATE(INFRASTRUCTURES)
{
    @type,
    @name
}

The next step is to define two transformers, each will manage a different type of infrastructure. The first transformer is called Airport and its purpose is to transfer the information extracted by the Airport transformer into the INFRASTRUCTURES template. The body of the transformer states that the template field @type must contain the word airport as a constant value every time the transformer Airport extracts a value from a text, while the template field @name will contain the proper name of the infrastructure found in the text and transformed using the transformation option SMARTENTRY. The second transformer performs the same actions and functions just described, but this time with the transformer called Railway.

DEFINE Airport = TRANSFORM(TOKEN) IN TEMPLATE(INFRASTRUCTURES)
{
    "airport" IN type
    TRANSFORMATION (SMARTENTRY) IN name
}

DEFINE Railway = TRANSFORM(TOKEN) IN TEMPLATE(INFRASTRUCTURES)
{
    "railway" IN type
    TRANSFORMATION (SMARTENTRY) IN name
}

Finally, two extraction rules are defined, one for each transformer. The first extraction rule extracts all the proper names (+ TYPE(NPR)) of airports (ANCESTOR(12830)), while the second extracts any recognized proper name of railway (ANCESTOR(19647)).

SCOPE SENTENCE
{
    IDENTIFY(INFRASTRUCTURES)
    {
        @@Airport[ANCESTOR(12830) + TYPE(NPR)] // 12830: airport, airdrome, aerodrome
    }

    IDENTIFY(INFRASTRUCTURES)
    {
        @@Railway[ANCESTOR(19647) + TYPE(NPR)] // 19647: railway, rail road, railway line, railway network, 
    }
}

Using the above rules to extract infrastructure information from the following text:

The Canadian Pacific Railway (CPR), formerly also known as CP Rail (reporting mark CP) between 1968 and 1996, is a historic Canadian Class I railroad founded in 1881 and now operated by Canadian Pacific Railway Limited (a subsidiary of Canadian Pacific Limited), which began operations as legal owner in a corporate restructuring in 2001.
Historically, Canadian Pacific operated several non-railway businesses. In 1971, these businesses were split off into the separate company Canadian Pacific Limited, and in 2001, that company was further split into five companies. CP no longer provides any of these services.
Canadian Pacific Airlines, also called CP Air, operated from 1942 to 1987 and was the main competitor of Canadian government-owned Air Canada. Based at Vancouver International Airport, it served Canadian and international routes until it was purchased by Pacific Western Airlines which merged PWA and CP Air to create Canadian Airlines.

The extraction output will contain two records:

Template: INFRASTRUCTURES

@name @type
CPR railway

Template: INFRASTRUCTURES

@name @type
Vancouver International Airport airport

The first record contains information related to a railway line, whereas the second contains information related to an airport. Both entities have been mapped on the same template but each one is clearly identified by the field @type. The first entity also shows the result of the SMARTENTRY transformation which finds in the text all the forms in which the concept is expressed (CPR, Canadian Pacific Railway) and normalizes them to a single constant form (CPR).

You can also combine more than one transformation option and/or add a constant value to the extracted value. Consider for instance a new requirement: extract Canadian Pacific Railway with the SMARTENTRY transformation and also include its syncon ID in the extraction record. The previously defined code would be modified as follows:

DEFINE Railway = TRANSFORM(TOKEN) IN TEMPLATE(INFRASTRUCTURES)
{
    "railway" IN type
    TRANSFORMATION (SMARTENTRY) + " " + TRANSFORMATION(SYNCON) IN name
}

Following the SMARTENTRY transformation, a space has been added using the plus sign to add it to the final output. The space was added in order to separate the first value from the newly added syncon ID (see the SYNCON transformation). According to the new transformer definition, the values extracted in the field called @name are now modified as such:

Template: INFRASTRUCTURES

@name @type
CPR 99667 railway

To better identify the two elements that now make up this value, another constant value can be added to act as a qualifier.

DEFINE Railway = TRANSFORM(TOKEN) IN TEMPLATE(INFRASTRUCTURES)
{
    "railway" IN type
    TRANSFORMATION (SMARTENTRY) +" - Syncon ID: " + TRANSFORMATION(SYNCON) IN name
}

The new command line states that the output should contain the entry for the extracted value, a space, a dash, another space, the constant - Syncon ID: and finally the ID associated to the extracted syncon. According to the new transformer definition, the value extracted in the field @name will now be:

Template: INFRASTRUCTURES

@name @type
CPR - Syncon ID: 99667 railway