Rules language
Overview
Kill lists and rule concept entities are expressions that must be interpreted as conditions that the text of a document must satisfy.
The rules that serve to confirm the extraction of concepts in thesaurus models are composed by combining taxonomy concepts, rule concepts or rule concept entities with specific operators.
The building blocks of this language for writing expressions and rules are described below.
Kill lists e rule concept entities
Operands
Kill lists and rule concept entities have the same structure.
Each operand is compared with the tokens into which the text is divided and takes on a true value if it matches, depending on its type, with the literal text of the token or with token features discovered by NLU analysis.
The type of an operand can be:
- Pattern: true if the token's text matches a Perl compatible regular expression
- Keyword: true if the token's text is equal to a string
- Word: true if the token's text is any of the inflected forms of a lemma
- Type: true if the token has a given grammatical type or is a named entity of a given type
For all operands except Type there is a sub-condition on the word class or entity type of the token. For Pattern and Keyword, a sub-condition on the case is also available, that is whether the match between the value of the operand and the text of the token must be case sensitive or not.
For example, to this visual builder:
corresponds this portion of the expression:
WORD("firm") + TYPE(NOU)
which is true for every token that is an inflection of the firm lemma and is a noun, as in:
Click here to research the best law firm
Operators
Multiple operands are combined two by two with Boolean or positional operators.
The Boolean operators are:
- AND: the combination is true only if both the operands are true.
- AND NOT: the combination is true if the first operand is true and the second operand is false.
- OR: the combination is true if one of the two operands is true.
AND
and AND NOT
have precedence over OR
, so in:
A OR B AND C
B AND C
is evaluated first, then its result is ORed with A
.
Therefore, if:
- A is true
- B is false
- C is false
the expression is true, since B AND C
is false and A OR false
is true.
If OR
was evaluated first, the expression would be false, since A OR B
is true and true AND C
is false.
Positional operators are:
- Strict sequence (
>>
): the tokens matched by the operands on the sides of the operator must be strictly consecutive, no other token is allowed between them. - Loose sequence (
>
): the tokens matched by the operands on the sides of the operator must be positioned one after the other, but tokens with low semantic value—adjectives, adverbs, conjunctions, articles, punctuation—are allowed between them. - Flexible sequence (DISTANCE in the visual builder,
<Min,Max>
in the expression): the tokens matched by the operands on its sides must be positioned one after the other, in the same sentence, at a distance—in tokens—that falls between the value of the Min parameter and the value of the Max parameter. Punctuation is counted as a token when computing the distance.
Rules
Scope
Rule scope define the span of text in which all the rule's operands must be matched.
The scope can be:
- A clause
- One or more consecutive sentences
- A paragraph
So, in the case of sentences, an option called multiplier specifies the number of consecutive sentences.
Operands
The operands of a rule can be taxonomy concepts, rule concepts, or rule concept entities.
Operators
Multiple operands in a rule are combined two by two with Boolean or positional operators.
The Boolean operators are:
- AND: the combination is true only if both the operands are true.
- AND NOT: the combination is true if the first operand is true and the second operand is false.
Positional operators are:
- Flexible sequence (DISTANCE in the visual builder,
<Min,Max>
in the expression): the portion of text matched by the first operand must be followed, within the rule's scope, by the portion of text matched by the second operand at a distance—in text tokens—between the value of the Min parameter and the value of the Max parameter. Punctuation is counted as a token when computing the distance. - NEXT: the portion of text matched by the first operand must be followed, in any subsequent sentence within the rule's scope, by the portion of text matched by the second operand.
- NEXT NOT: the portion of text matched by the first operand must not be followed by the portion of text matched by the second operand in all the subsequent sentences within the rule's scope.
- PREV: the portion of text matched by the first operand must be preceded, in any previous sentence within the rule's scope, by the portion of text matched by the second operand.
- PREV NOT: the portion of text matched by the first operand must not be preceded by the portion of text matched by the second operand in all the previous sentences within the rule's scope,
Word classes
A word class can be specified as a sub-condition for the Pattern, Keyword and Word operands of kill lists and rule concept entities, while it can be chosen as the value of a Type operand.
Label | Description |
---|---|
ADJ |
Adjective |
ADV |
Adverb |
NOU |
Noun |
NPR |
Proper noun |
VER |
Verb |
Named entities types
A named entity type can be specified as a sub-condition for the Pattern, Keyword and Word operands of kill lists and rule concept entities, while it can be chosen as the value of a Type operand.
Label | Description | Example |
---|---|---|
ADR |
Street address | Who lived at 221B Baker Street? |
ANM |
Animal | Felix is an anthropomorphic black cat. |
BLD |
Building | While in London I attended a concert at the Royal Albert Hall. |
COM |
Company, business | Tesla Inc. sold 10% of its Bitcoin holdings. |
DAT |
Date | Napoleon died on May 5, 1821. |
DEV |
Device | My new Galaxy smartphone has seven cameras. |
DOC |
Document | I appeal to the Geneva Convention! |
ENT |
Generic entity | I have five minutes left. |
EVN |
Event | Felice Gimondi won the Tour de France in 1965. |
FDD |
Food, beverage | Frank likes to drink Guinness beer. |
GEA |
Physical geographic feature | I crossed the Mississipi river with my boat. |
GEO |
Administrative geographic area | Alaska is the least densely populated state in the United States. |
GEX |
Extended geography | The astronauts have landed on Mars. |
HOU |
Hours | The eclipse reached its peak at 3pm. |
LEN |
Legal entity | Of course I pay the FICA tax. |
MAI |
Email address | For any questions do not hesitate to write to [email protected]. |
MEA |
Measure | The chest is five feet wide and 40 inches tall. |
MMD |
Mass media | I read it in the Guardian. |
MON |
Money | I sold half of my stocks and made six hundred thousand dollars. |
NPH |
Person | Hakeem Olajuwon dunked effortlessly. |
ORG |
Organization, institution, society | Now they threaten to quit the United Nations if they are not heard. |
PCT |
Percentage | The richest 10% of adults in the world own 85% of global wealth. |
PHO |
Phone number | For poor database design, call (214) 748-3647. |
PPH |
Physical phenomena | The COVID-19 infection is slowing down. |
PRD |
Product | The Rolex Daytona is a wonderful watch. |
VCL |
Vehicle | A Ferrari 250 GTO was the most expensive car ever sold. |
WEB |
Web address | Find the best technical documentation at docs.expert.ai. |
WRK |
Work of human intelligence | Grease is a funny musical romantic comedy. |