Domain score
Overview
At the beginning of the categorization process, every domain defined in the project's taxonomy has a score of zero.
During the process, every time a rule is activated, the score of its domain is increased1 by a certain amount of points.
At the end of the categorization process, some domains may then have a positive score and be returned as output categories.
It is possible that an input text does not trigger any rule (and therefore, no categories are returned), either because the text is not related to any of the domains, or because the rules are incomplete or not well-designed.
Post-processing scripts can be used to filter the output categories, for example to keep only those with the highest scores.
Score options
The amount of points a domain receives depends on several factors, such as:
- The type and the structure of the condition.
- The portion of document in which the condition is met.
- The amount of text that's matched.
- The rule's score option.
The score option syntax is:
DOMAIN(domainName:scoreOption)
The score option is the fundamental variable in the calculation of points. In the simplest cases, the option alone determines the exact amount of points given to the domain when the rule is triggered, while in more complex cases it is combined with other variables.
Read about the scoring mechanism to know more about how the other variables are used to compute points.
Standard options
The standard score options are listed in the following table.
Option | Description | Points |
---|---|---|
NORMAL | The default/implicit score option | 10 |
LOW | Lower than the default | 3 |
HIGH | Higher than the default | 15 |
SELECT | Forces a domain in the categorization output | A large positive number |
DISCARD | Forces a domain out of the categorization output | A large negative number |
The first three options assign three slightly different amounts of points. They are useful to give more or less relevance to some rules. When deciding between these options, the rule of thumb is:
- Do not specify any score option (or specify
NORMAL
, which is equivalent, if you prefer to have explicit options in your source code) in most cases. - Use the
HIGH
option to attribute more weight to few, very selective rules, where "selective" means a rule with a condition made to match text that is very specific of the domain. When triggered, these rules will boost a domain's score. - Use the
LOW
option for "weak" rules, where "weak" means a rule with a condition which is not specific of a certain domain. When triggered, a weak rule has a negligible impact on the overall domain score. However, if several of these rules trigger, then their impact will become significant because many tiny clues are like one big clue.
The last two options influence the domain score so much that the domain either becomes a guaranteed winner or it's removed from the categorization results.
The SELECT
score option has an even greater impact, because every time a rule with this option generates a hit, the domain to which it belongs is automatically inserted into the highest ranking output. On the other hand, every time that a rule containing a DISCARD
score option generates a hit, the domain to which the rule belongs is automatically discarded from the output, even if the domain has a positive score due to other rules.
Using the DISCARD
score option is like defining a "negative rule" for a domain because it is invalidated when the rule's condition is met.
Note
Domains for which a DISCARD
score was triggered will appear with a score of 0 within the ALL
predefined set, used by the onCategorizer
manipulation function.
Custom options
It is possible to create custom score options. They can be defined in the config.cr
file.
The syntax is:
SCORES
{
@scoreOptionName:points,
...
}
For example:
SCORES
{
@LOWER:1,
@HIGHER:20
}
Note
Point values can be either a positive or a negative integer.
Once defined, the names of the new options can be used in the categorization rules thus providing greater variability of the rules' scores.
Modifiers
The categorization score of domains can also be altered by two modifiers:
BOOSTER
FADER
BOOSTER
The BOOSTER
modifier is used to boost the total score of domains after a number of distinct rules is triggered on them.
The first step is the booster declaration. It has this syntax:
BOOSTER()
{
POINT(numberOfRules1, scoreMultiplier1),
POINT(numberOfRules2, scoreMultiplier2),
...
POINT(numberOfRules#, scoreMultiplier#)
}
where:
BOOSTER
andPOINT
are language keywords and must be written in uppercase.numberOfRules#
is the number of rules that must trigger on the domains to boost their score.scoreMultiplier#
is the score multiplier, which should be higher than 1. Decimal values are mandatory with a dot as separator. In case of an integer, a decimal value of 0 must be specified.
At least two POINT
are necessary for the declaration, otherwise an error will occur.
You can only make one declaration per project, otherwise an error will occur.
As an example, consider this booster declaration and these categorization rules:
BOOSTER()
{
POINT(2, 2.0),
POINT(4, 4.0)
}
SCOPE SENTENCE
{
DOMAIN(dom1)
{
LEMMA("dog")
}
DOMAIN(dom1)
{
TYPE(NOU)
}
}
applied to this input text:
My dog is beautiful, her dog is not.
The dom1 domain gets a score of 80 obtained in this way:
- Two occurrences of the lemma dog by two rules triggering on it = 40 points.
- A score multiplier of 2 defined in the booster declaration applied to the total score = 80 points.
If the number of rules is a value in between those of the declaration, the score multiplier value will also be a value in between those of the declaration.
Considering the example above, if the lemma dog is triggered by three rules, you will get a score of 180 obtained like this:
- Two occurrences of the lemma dog by three rules triggering on it = 60 points.
- A score multiplier of 3 not defined in the booster declaration applied to the total score = 180 points.
FADER
The FADER
modifier is used to decrease the total score of domains associating a mitigation value to each activation of the same rule.
The first step is the fader declaration. It has this syntax:
FADER([INTEGRAL])
{
POINT(ruleTriggeringNumber1, scoreMultiplier1),
POINT(ruleTriggeringNumber2, scoreMultiplier2),
...
POINT(ruleTriggeringNumber#, scoreMultiplier#)
}
where:
FADER
andPOINT
are language keywords and must be written in uppercase.ruleTriggeringNumber#
is the number of times the same rule is triggered for a domain.scoreMultiplier#
is a decimal value—with a dot as separator—corresponding to the score multiplier.INTEGRAL
is an optional language keyword and must be written in uppercase. When declared, it appliesscoreMultiplier#
on each rule hit instead of the final score (see example below).
At least two POINT
are necessary for the declaration, otherwise an error will occur.
You can only make one declaration per project, otherwise an error will occur.
As an example, consider this fader declaration and this categorization rule:
FADER()
{
POINT(2, 0.5),
POINT(4, 0.3),
POINT(6, 0.1)
}
SCOPE SENTENCE
{
DOMAIN(dom1)
{
LEMMA("dog")
}
}
applied to this input text:
My dog is beautiful, her dog is not.
The dom1 domain gets a score of 10 obtained in this way:
- Two occurrences of dog = 20 points.
- Score multiplier of 0.5 applied to the total score = 10 points.
If the rule triggering number is a value in between those of the declaration, the score multiplier value will also be a value in between those of the declaration.
Considering the example above, if the lemma dog is triggered three times, you will get a score of 12 obtained like this:
- Three occurrences of dog = 30 points.
- Score multiplier of 0.4 applied to the total score = 12 points.
If the fader declaration is:
FADER(INTEGRAL)
{
POINT(2, 0.5),
POINT(4, 0.3),
POINT(6, 0.1)
}
and the same categorization rule is applied to this input text:
My dog is beautiful, her dog is not, his dog is black, your dog is brown.
The dom1 domain gets a score of 22 obtained like this:
10 + (10*0.5) + (10*0.4) + (10*0.3) = 22
With INTEGRAL
, the score multiplier is applied to each hit triggered by the rule instead of being applied to the total score.
-
With the exception of the
DISCARD
score option that corresponds to a large negative amount. ↩