Skip to content

Default normalization

Overview

By default, the value of extracted fields and tag instances can be exactly the portion of text that matches the field-prefixed or tag-prefixed operand of the rule condition or a normalized version of that text. Below you can find the description of the cases in which normalization is performed.

Normalization to the lemma

If the rule's condition has one of these attributes:

the value is normalized to the base form of the lemma of the matched token, like the singular in the case of a noun that also admits the plural or the infinitive for verbs.

For example this rule:

SCOPE SENTENCE
{
    IDENTIFY(PETS)
    {
        @TYPE[LEMMA("dog")]
    }
}

applied to this input text:

I have two dogs.

will set field TYPE to value dog, that is the lemma—or base form—of dogs.

Normalization to the main lemma of the syncon

If the rule's condition has one of these attributes:

  • SYNCON
  • ANCESTOR
  • LIST
  • BLIST
  • TYPE with entity types corresponding to proper nouns like:

    • ANM
    • BLD
    • DEV
    • DOC
    • ENT
    • FDD
    • GEA
    • GEO
    • GEX
    • LEN
    • MMD
    • NPH
    • ORG
    • PPH
    • PRD
    • VCL
    • WRK

    and the named entity is defined in the text intelligence engine's knowledge graph, for instance the name of a very famous organization.

its value is the main lemma of the knowledge graph syncon corresponding to the token text.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @BIRTH_PLACE[SYNCON(100000065)]//@SYN: #100000065# [Los Angeles]
    }
}

applied to this text:

He was born in L.A.

will extract a PERSONAL_DATA record with an instance of field BIRTH_PLACE field set to Los Angeles because the knowledge graph contains a syncon for the American city and the syncon's main lemma is Los Angeles. For an analogous reason this rule:

SCOPE SENTENCE
{
    IDENTIFY(COMPANIES)
    {
        @NAME[TYPE(COM)]
    }
}

applied to this text:

Microsoft performed relatively well this quarter.

will normalize Microsoft to:

Microsoft Corporation

If an entity is recognized heuristically and not because of the presence of a corresponding syncon in the knowledge graph, the value is the literal text, without any normalization. For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(COMPANY)
    {
        @NAME[TYPE(COM)]
    }
}

applied to this text:

Acme Group Holding Limited has been founded by John Smith in 1999.

will set field @NAME to Acme Group Holding Limited because there's no syncon for Acme in the knowledge graph but it is nevertheless recognized as the name of a company.

Normalization with formatted values

If the TYPE attribute is used with one of these entity types:

the value is interpreted and formatted as described below. The format may vary based on the text language.

TYPE(ADR)

Addresses (TYPE(ADR)) are formatted like this:

  • street number, street name for English.
  • street name, street number for the other languages.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @ADDRESS[TYPE(ADR)]
    }
}

applied to this text:

ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
Nashville, TN 37011

will extract:

90210, Broadway Boulevard - Nashville (TENN.) - 37011

Other elements, like the city, the state/country and the zip code are also extracted—if they are in the sentence—but their position in the extracted value is not normalized.

For example, the same address written in this way:

ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
37011, Nashville, TN

will be normalized to:

90210, Broadway Boulevard - 37011 - Nashville (TENN.)

TYPE(DAT)

Dates (TYPE(DAT) are normalized according to these formats:

  • MMM-(D)D-YYYY for English and German.
  • YYYY-(M)M月-(D)D for Chinese and Japanese.
  • YYYY-(M)M월-(D)D for Korean.
  • YYYY-Month-(D)D for Arabic.
  • (D)D-MMM-YYYY for all other languages.

If the day of the week is included, the format changes as follows:

  • ddd, MMM-(D)D-YYYY for English and German.
  • YYYY-Month-(D)D ,ddd for Arabic.
  • ddd, (D)D-MMM-YYYY for:

    • Italian
    • Dutch
    • Spanish
    • Portuguese
    • Russian
    • French

French months Juin and Juillet are normalized to Juin and Juil. Arabic months and days of the week are written in full words.

If in the text the day is from 01 to 09 the zero will be removed. Months are in words, in their abbreviated forms.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @DATE_OF_BIRTH[TYPE(DAT)]
    }
}

applied to this text:

I was born on Thursday, June 15th 1995.

will extract:

Thu, Jun-15-1995

TYPE(HOU)

Time (TYPE(HOU)) is normalized to this format::

(H)H:MM

with a 24-hour notation. If the first digit of the hour in the text is zero, it gets removed in the normalized value.

For example, this rule:

SCOPE SENTENCE
{
    IDENTIFY(TIME)
    {
        @HOUR[TYPE(HOU)]
    }
}

applied to these texts:

It's 09:30 P.M.
It's half past eight.

will extract:

21:30
8:30

TYPE(MEA)

Measures (TYPE(MEA)) are normalized to their numerical value followed by the unit of measure.

For example, the distance in:

His house is 20.5 KM from the city.

will be extracted as:

20.5 kilometer

The unit of measure is expressed in the text language, for example:

Language Unit of measure for kilometers
English kilometer
Spanish kilómetro
French kilomètre
Russian километр
Portuguese quilômetro
German Kilometer
Italian chilometro
Dutch kilometer
Arabic كيلومتر
Japanese キロメートル
Korean 킬로미터
Chinese 公里

The thousands separator is comma for English and Arabic and a period for the other languages. The decimal separator is the period for English and Arabic and the comma for the other languages.
Leading and trailing zeros are removed.
The unit of measure is written in its extended version and in the singular.

TYPE(MON)

Amounts of money (TYPE(MON)) are normalized to their numerical value followed by the name of the currency.

For example, the amount in:

I earned forty dollars.

is extracted as:

40 dollar

The name of the currency is expressed in the text language, for example:

Language(s) Name of the currency for euros
Russian евро
Japanese ユーロ
Chinese 欧元
Korean 유로
Arabic اورو
German Euro
Other euro

As for measures (see above):

  • The thousands separator is comma for English and Arabic and a period for the other languages. The decimal separator is the period for English and Arabic and the comma for the other languages.
  • Leading and trailing zeros are removed.
  • The currency is written in the singular.

TYPE(PCT)

Percentages (TYPE(PCT)) are normalized to the numerical value followed by the percent sign.

For example, the percentage in:

Forty per cent of voters voted no.

is extracted as:

40%

The decimal separator is the period for English and Arabic and the comma for the other languages. Leading and trailing zeros are removed.

TYPE(PHO)

Phone numbers (TYPE(PHO)) are normalized like this:

[international prefix with plus sign ]number

For example, both:

0044 1632 960938

and:

+441632960938

are normalized to:

+44 1632960938

TYPE(WEB)

Web addresses (TYPE(WEB)) are normalized to the domain name followed by the path.

For example:

https://try.expert.ai

is normalized to:

try.expert.ai

while:

https://docs.expert.ai/studio/2023.1/languages/attributes/ancestor/

is normalized to:

docs.expert.ai/studio/2023.1/languages/attributes/ancestor