Skip to content

Default normalization

Overview

By default, the value of an extracted field can be exactly the portion of text that matches the rule condition or a normalized version of that text. Below you can find the description of the cases in which normalization is performed.

Normalization to the lemma

If the field is extracted using one of these attributes:

its value is normalized to the base form of the lemma, like the singular in the case of a noun that also admits the plural or the infinitive for verbs.

For example this rule:

SCOPE SENTENCE
{
    IDENTIFY(PETS)
    {
        @DOG[LEMMA("dog")]
    }
}

applied to this input text:

I have two dogs.

will extract the lemma as:

dog

Normalization to the main lemma of the syncon

If the field is extracted using one of these attributes:

  • SYNCON
  • ANCESTOR
  • LIST
  • BLIST
  • TYPE with types corresponding to proper nouns like:

    • ANM
    • BLD
    • DEV
    • DOC
    • ENT
    • FDD
    • GEA
    • GEO
    • GEX
    • LEN
    • MMD
    • NPH
    • ORG
    • PPH
    • PRD
    • VCL
    • WRK

    if the extracted named entity is defined in the text intelligence engine's Knowledge Graph, for instance the name of a very famous organization.

its value is the main lemma of the Knowledge Graph syncon corresponding to the extracted text.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @BIRTH_PLACE[SYNCON(100000065)]//@SYN: #100000065# [Los Angeles]
    }
}

applied to this text:

He was born in L.A.

will extract the syncon as:

Los Angeles

because the engine's Knowledge Graph has a syncon for the American city and the syncon's main lemma is Los Angeles. For an analogous reason this rule:

SCOPE SENTENCE
{
    IDENTIFY(COMPANIES)
    {
        @NAME[TYPE(COM)]
    }
}

applied to this text:

Microsoft performed relatively well this quarter.

will normalize Microsoft to:

Microsoft Corporation

Warning

Entity name normalization occurs only if there is a corresponding syncon in the text intelligence engine's Knowledge Graph. Many named entities are recognized and extracted without the need of a corresponding syncon in the Knowledge Graph: in this case, the value of the field is the extracted text. For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(COMPANY)
    {
        @NAME[TYPE(COM)]
    }
}

applied to this text:

Acme Group Holding Limited has been founded by John Smith in 1999.

will extract:

Acme Group Holding Limited

without normalization.

Normalization with formatted values

If the field is extracted with the TYPE attribute and one of these types:

  • ADR
  • DAT
  • HOU
  • MEA
  • MON
  • PCT
  • PHO
  • WEB

its value is interpreted and formatted as described below. The format may vary based on the text language.

TYPE(ADR)

Addresses (TYPE(ADR)) are formatted like this:

  • street number, street name for English.
  • street name, street number for the other languages.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @ADDRESS[TYPE(ADR)]
    }
}

applied to this text:

ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
Nashville, TN 37011

will extract:

90210, Broadway Boulevard - Nashville (TENN.) - 37011

Other elements, like the city, the state/country and the zip code are also extracted—if they are in the sentence—but their position in the extracted value is not normalized.

For example, the same address written in this way:

ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
37011, Nashville, TN

will be normalized to:

90210, Broadway Boulevard - 37011 - Nashville (TENN.)

TYPE(DAT)

Dates (TYPE(DAT) are normalized according to these formats:

  • MMM-(D)D-YYYY for English and German.
  • (D)D-MMM-YYYY for all other languages.

Note

French months Juin and Juillet are normalized to Juin and Juil.

If in the text the day is from 01 to 09 the zero will be removed. Months are in words, in their abbreviated forms.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @DATE_OF_BIRTH[TYPE(DAT)]
    }
}

applied to this text:

He was born on the seventh of June, 1995.

will extract:

Jun-7-1995

TYPE(HOU)

Time (TYPE(HOU)) is normalized to this format::

(H)H:MM

with a 24-hour notation. If the first digit of the hour in the text is zero, it gets removed in the normalized value.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(TIME)
    {
        @HOUR[TYPE(HOU)]
    }
}

applied to these texts:

It's 09:30 P.M.
It's half past eight.

will extract:

21:30
8:30

TYPE(MEA)

Measures (TYPE(MEA)) are normalized to their numerical value followed by the unit of measure.

For example, the distance in:

His house is 20.5 KM from the city.

will be extracted as:

20.5 kilometer

The unit of measure is expressed in the text language, for example:

Language Unit of measure for kilometers
English kilometer
Spanish kilómetro
French kilomètre
Russian километр
Portuguese quilômetro
German Kilometer
Italian chilometro
Dutch kilometer

The thousands separator is comma for English and a period for the other languages. The decimal separator is the period for English and the comma for the other languages.
Leading and trailing zeros are removed.
The unit of measure is written in its extended version and in the singular.

TYPE(MON)

Amounts of money (TYPE(MON)) are normalized to their numerical value followed by the name of the currency.

For example, the amount in:

I earned forty dollars.

is extracted as:

40 dollar

The name of the currency is expressed in the text language, for example:

Language(s) Name of the currency for euros
Russian евро
German Euro
Other euro

As for measures (see above):

  • The thousands separator is comma for English and a period for the other languages. The decimal separator is the period for English and the comma for the other languages.
  • Leading and trailing zeros are removed.
  • The currency is written in the singular.

TYPE(PCT)

Percentages (TYPE(PCT)) are normalized to the numerical value followed by the percent sign.

For example, the percentage in:

Forty per cent of voters voted no.

is extracted as:

40%

The decimal separator is the period for English and the comma for the other languages. Leading and trailing zeros are removed.

TYPE(PHO)

Phone numbers (TYPE(PHO)) are normalized like this:

[international prefix with plus sign ]number

For example, both:

0044 1632 960938

and:

+441632960938

are normalized to:

+44 1632960938

TYPE(WEB)

Web addresses (TYPE(WEB)) are normalized to the domain name followed by the path.

For example:

https://try.expert.ai

is normalized to:

try.expert.ai

while:

https://docs.expert.ai/studio/latest/languages/attributes/ancestor/

is normalized to:

docs.expert.ai/studio/latest/languages/attributes/ancestor