Default normalization
Overview
By default, the value of extracted fields and tag instances can be exactly the portion of text that matches the field-prefixed or tag-prefixed operand of the rule condition or a normalized version of that text. Below you can find the description of the cases in which normalization is performed.
Normalization to the lemma
If the rule's condition has one of these attributes:
the value is normalized to the base form of the lemma of the matched token, like the singular in the case of a noun that also admits the plural or the infinitive for verbs.
For example this rule:
SCOPE SENTENCE
{
IDENTIFY(PETS)
{
@TYPE[LEMMA("dog")]
}
}
applied to this input text:
I have two dogs.
will set field TYPE to value dog, that is the lemma—or base form—of dogs.
Normalization to the main lemma of the syncon
If the rule's condition has one of these attributes:
SYNCON
ANCESTOR
LIST
BLIST
-
TYPE
with entity types corresponding to proper nouns like:ANM
BLD
DEV
DOC
ENT
FDD
GEA
GEO
GEX
LEN
MMD
NPH
ORG
PPH
PRD
VCL
WRK
and the named entity is defined in the text intelligence engine's knowledge graph, for instance the name of a very famous organization.
its value is the main lemma of the knowledge graph syncon corresponding to the token text.
For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@BIRTH_PLACE[SYNCON(100000065)]//@SYN: #100000065# [Los Angeles]
}
}
applied to this text:
He was born in L.A.
will extract a PERSONAL_DATA record with an instance of field BIRTH_PLACE field set to Los Angeles because the knowledge graph contains a syncon for the American city and the syncon's main lemma is Los Angeles. For an analogous reason this rule:
SCOPE SENTENCE
{
IDENTIFY(COMPANIES)
{
@NAME[TYPE(COM)]
}
}
applied to this text:
Microsoft performed relatively well this quarter.
will normalize Microsoft to:
Microsoft Corporation
If an entity is recognized heuristically and not because of the presence of a corresponding syncon in the knowledge graph, the value is the literal text, without any normalization. For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(COMPANY)
{
@NAME[TYPE(COM)]
}
}
applied to this text:
Acme Group Holding Limited has been founded by John Smith in 1999.
will set field @NAME to Acme Group Holding Limited because there's no syncon for Acme in the knowledge graph but it is nevertheless recognized as the name of a company.
Normalization with formatted values
If the TYPE
attribute is used with one of these entity types:
the value is interpreted and formatted as described below. The format may vary based on the text language.
TYPE(ADR)
Addresses (TYPE(ADR)
) are formatted like this:
street number, street name
for English.street name, street number
for the other languages.
For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@ADDRESS[TYPE(ADR)]
}
}
applied to this text:
ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
Nashville, TN 37011
will extract:
90210, Broadway Boulevard - Nashville (TENN.) - 37011
Other elements, like the city, the state/country and the zip code are also extracted—if they are in the sentence—but their position in the extracted value is not normalized.
For example, the same address written in this way:
ATTN: Dennis Menees, CEO
Global Co.
90210 Broadway Blvd.
37011, Nashville, TN
will be normalized to:
90210, Broadway Boulevard - 37011 - Nashville (TENN.)
TYPE(DAT)
Dates (TYPE(DAT
) are normalized according to these formats:
MMM-(D)D-YYYY
for English and German.YYYY-(M)M月-(D)D
for Chinese and Japanese.YYYY-(M)M월-(D)D
for Korean.YYYY-Month-(D)D
for Arabic.(D)D-MMM-YYYY
for all other languages.
If the day of the week is included, the format changes as follows:
ddd, MMM-(D)D-YYYY
for English and German.YYYY-Month-(D)D ,ddd
for Arabic.-
ddd, (D)D-MMM-YYYY
for:- Italian
- Dutch
- Spanish
- Portuguese
- Russian
- French
French months Juin and Juillet are normalized to Juin and Juil. Arabic months and days of the week are written in full words.
If in the text the day is from 01 to 09 the zero will be removed. Months are in words, in their abbreviated forms.
For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@DATE_OF_BIRTH[TYPE(DAT)]
}
}
applied to this text:
I was born on Thursday, June 15th 1995.
will extract:
Thu, Jun-15-1995
TYPE(HOU)
Time (TYPE(HOU)
) is normalized to this format::
(H)H:MM
with a 24-hour notation. If the first digit of the hour in the text is zero, it gets removed in the normalized value.
For example, this rule:
SCOPE SENTENCE
{
IDENTIFY(TIME)
{
@HOUR[TYPE(HOU)]
}
}
applied to these texts:
It's 09:30 P.M.
It's half past eight.
will extract:
21:30
8:30
TYPE(MEA)
Measures (TYPE(MEA)
) are normalized to their numerical value followed by the unit of measure.
For example, the distance in:
His house is 20.5 KM from the city.
will be extracted as:
20.5 kilometer
The unit of measure is expressed in the text language, for example:
Language | Unit of measure for kilometers |
---|---|
English | kilometer |
Spanish | kilómetro |
French | kilomètre |
Russian | километр |
Portuguese | quilômetro |
German | Kilometer |
Italian | chilometro |
Dutch | kilometer |
Arabic | كيلومتر |
Japanese | キロメートル |
Korean | 킬로미터 |
Chinese | 公里 |
The thousands separator is comma for English and Arabic and a period for the other languages. The decimal separator is the period for English and Arabic and the comma for the other languages.
Leading and trailing zeros are removed.
The unit of measure is written in its extended version and in the singular.
TYPE(MON)
Amounts of money (TYPE(MON)
) are normalized to their numerical value followed by the name of the currency.
For example, the amount in:
I earned forty dollars.
is extracted as:
40 dollar
The name of the currency is expressed in the text language, for example:
Language(s) | Name of the currency for euros |
---|---|
Russian | евро |
Japanese | ユーロ |
Chinese | 欧元 |
Korean | 유로 |
Arabic | اورو |
German | Euro |
Other | euro |
As for measures (see above):
- The thousands separator is comma for English and Arabic and a period for the other languages. The decimal separator is the period for English and Arabic and the comma for the other languages.
- Leading and trailing zeros are removed.
- The currency is written in the singular.
TYPE(PCT)
Percentages (TYPE(PCT)
) are normalized to the numerical value followed by the percent sign.
For example, the percentage in:
Forty per cent of voters voted no.
is extracted as:
40%
The decimal separator is the period for English and Arabic and the comma for the other languages. Leading and trailing zeros are removed.
TYPE(PHO)
Phone numbers (TYPE(PHO)
) are normalized like this:
[international prefix with plus sign ]number
For example, both:
0044 1632 960938
and:
+441632960938
are normalized to:
+44 1632960938
TYPE(WEB)
Web addresses (TYPE(WEB)
) are normalized to the domain name followed by the path.
For example:
https://try.expert.ai
is normalized to:
try.expert.ai
while:
https://docs.expert.ai/studio/2023.1/languages/attributes/ancestor/
is normalized to:
docs.expert.ai/studio/2023.1/languages/attributes/ancestor