Domain score
Overview
At the beginning of the categorization process, every domain defined in the project's taxonomy has a score of zero.
During the process, every time a rule is activated, the score of its domain is increased1 by a certain amount of points.
At the end of the categorization process, some domains may then have a positive score and be returned as output categories.
It is possible that an input text does not trigger any rule (and therefore, no categories are returned), either because the text is not related to any of the domains, or because the rules are incomplete or not well-designed.
Post-processing scripts can be used to filter the output categories, for example to keep only those with the highest scores.
Score options
The amount of points a domain receives depends on several factors, such as:
- The type and the structure of the condition.
- The portion of document in which the condition is met.
- The amount of text that's matched.
- The rule's score option.
The score option syntax is:
DOMAIN(domainName:scoreOption)
The score option is the fundamental variable in the calculation of points. In the simplest cases, the option alone determines the exact amount of points given to the domain when the rule is triggered, while in more complex cases it is combined with other variables.
Read about the scoring mechanism to know more about how the other variables are used to compute points.
Standard options
The standard score options are listed in the following table.
Option | Description | Points |
---|---|---|
NORMAL | The default/implicit score option | 10 |
LOW | Lower than the default | 3 |
HIGH | Higher than the default | 15 |
SELECT | Forces a domain in the categorization output | A large positive number |
DISCARD | Forces a domain out of the categorization output | A large negative number |
The first three options assign three slightly different amounts of points. They are useful to give more or less relevance to some rules. When deciding between these options, the rule of thumb is:
- Do not specify any score option (or specify
NORMAL
, which is equivalent, if you prefer to have explicit options in your source code) in most cases. - Use the
HIGH
option to attribute more weight to few, very selective rules, where "selective" means a rule with a condition made to match text that is very specific of the domain. When triggered, these rules will boost a domain's score. - Use the
LOW
option for "weak" rules, where "weak" means a rule with a condition which is not specific of a certain domain. When triggered, a weak rule has a negligible impact on the overall domain score. However, if several of these rules trigger, then their impact will become significant because many tiny clues are like one big clue.
The last two options influence the domain score so much that the domain either becomes a guaranteed winner or it's removed from the categorization results.
The SELECT
score option has an even greater impact, because every time a rule with this option generates a hit, the domain to which it belongs is automatically inserted into the highest ranking output. On the other hand, every time that a rule containing a DISCARD
score option generates a hit, the domain to which the rule belongs is automatically discarded from the output, even if the domain has a positive score due to other rules.
Using the DISCARD
score option is like defining a "negative rule" for a domain because it is invalidated when the rule's condition is met.
Custom options
It is possible to create custom score options. They can be defined in the config.cr
file.
The syntax is:
SCORES
{
@scoreOptionName:points,
...
}
For example:
SCORES
{
@LOWER:1,
@HIGHER:20
}
Once defined, the names of the new options can be used in the categorization rules thus providing greater variability of the rules' scores.
-
With the exception of the
DISCARD
score option that corresponds to a large negative amount. ↩