The scoring mechanism
The basic algorithm
The basic algorithm used to determine the amount of points generated by a categorization rule consists in the amount of points by which the rule domain is increased each time the rule is activated.
- For each operand of rule's condition:
- The number of matched tokens is computed, considering that a single
KEYWORD
orPATTERN
attribute can match two or more tokens. - The number of tokens is multiplied by the amount associated with the score option (
NORMAL
: 10,HIGH
: 15, etc.).
- The number of matched tokens is computed, considering that a single
- The amounts computed for all the operands are summed.
- If the section, in which the rule's condition was met, has a score multiplication factor, the amount of points will be multiplied by that factor.
- If the rule's condition was met inside a segment and that segment has a score multiplication factor, the amount of points will be multiplied by that factor.
A more complex algorithm is used in the case of conditions containing positional sequence operators.
KEYWORD attribute contribution
A single KEYWORD
attribute generates an amount of points that's proportional to the number of matched tokens, and this number, in turn, is influenced by the number of lemmas.
For example, in this text:
This credit card can help you establish a positive credit history.
the disambiguator recognizes credit card and credit history as multi-word lemmas.
This rule:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
KEYWORD("credit card")
}
}
is activated by the sample text and generates a NORMAL
amount of points (10). This is because credit card corresponds to one token in the disambiguation output.
On the other hand, this rule:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
KEYWORD("can help")
}
}
generates the NORMAL
amount of points twice—that is, 10 X 2 = 20—because can and help correspond to two lemmas and hence to two tokens.
If the rule is changed to:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
KEYWORD("credit card can help")
}
}
the amount of points generated by the KEYWORD
attribute will be 30, 20 due to can and help plus 10 due to credit card.
If more expressions are specified within the same attribute, as in:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
KEYWORD("help", "can")
}
}
the amount of points generated by each rule activation depends on the single expression.
In the case of the sample text, the rule is activated twice, the first time because of can and the second because of help. The amount of points generated by the first activation is 10 and the amount generated by the second activation is also 10. The sum of points for the domain is 20, because of the two distinct hits and not due to the single activation as seen in the first sample rule.
Boolean combinations
A condition made of two or more operands combined with Boolean operators (AND
, OR
, AND NOT
, XOR
) generates an amount of points proportional to the number of the operands that match.
The following rule, for example, always generates a score of 20, 10 for each operand:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
AND
LEMMA("match")
}
}
On the other hand, this rule:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
OR
LEMMA("match")
}
}
generates a variable amount of points: 10 points if only one of the two operands finds a match, 20 points when both operands match.
Complex Boolean combinations are deconstructed into simpler expressions.
For example, this rule:
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
AND
(
LEMMA("match")
OR
LEMMA("series")
)
}
}
has an inner OR
combination of two operands that can generate 10 or 20 points based on the number of operands that match, which can be only one or both.
The first operand always generates 10 points, so the total amount of points generated by the rule can be 10 + 10 = 20 or 10 + 20 = 30.
Any operand or sub-condition which follows the AND NOT
operator will not generate any points.
Therefore, every hit from the following rule will generate 20 points, as each of the two operands combined with AND
generates 10 points while the third operand does not generate any points.
SCOPE SENTENCE
{
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
AND
LEMMA("match")
AND NOT
LEMMA("terrorism")
}
}
Complex condition with positional or logical operators
A complex condition—containing more than one operand, co-joined with positional or logical operators—generates a score based on the following variant of the basic algorithm:
- Positional and logical operators become part of the operands.
- A condition built with these operators generates the square product of the number of the operands on both sides of operators, multiplied by the score defined in the header of the rule.
For example, this rule generates 40 points:
SCOPE SENTENCE
{
// Strict sequence of two one-token operands
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
>>
LEMMA("match")
}
}
The points are computed as follows:
- Number of operands present on both sides of a positional or logical operator: 2.
- Square product of the number of operands, on both sides of the positional or logical operator in the rule: 2 * 2 = 4.
- Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.
A rule constructed as follows:
SCOPE SENTENCE
{
// Flexible sequence of three one-token operands
DOMAIN(dom1:HIGH)
{
LEMMA("member")
<>
KEYWORD("of")
<>
LEMMA("House of Commons")
}
}
has three operands, each which matches one token, and are conjoined by the flexible sequence operator; the score option is set to HIGH
, therefore this rule generates 135 points. Here is the breakdown:
- Number of operands present on both sides of a positional or logical operator: 3.
- Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 3 * 3.
- Square product of the number of the operands multiplied by the score defined in the header of the file: 9 * 15 = 135.
Complex combinations are decomposed in simpler expressions. Consider for example this rule:
SCOPE SENTENCE
{
// Combination of flexible sequence and Boolean operator
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
<>
LEMMA("match")
AND
LEMMA("player")
}
}
Every instance of this rule generates 50 points:
- Number of operands present on both sides of a positional or logical operator: 2.
- Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 2 * 2.
- Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.
- Score generated by the operand, not included in a positional or logical operator: 10
- Sum of the score of all the operands: 40 + 10 = 50.
If two tokens disambiguated as one lemma are matched by two different operands, each operand will add its individual value to the final score of the rule, regardless of the disambiguation.
The following rule, for example, generates 40 points, because of the two operands co-joined by a positional operator.
SCOPE SENTENCE
{
// Strict sequence of two words of the same lemma.
DOMAIN(dom1:NORMAL)
{
KEYWORD("credit")
>>
KEYWORD("card")
}
}
Operands following the exclamation mark do not contribute to points generation.
The following rule, for example, generates 40 points:
SCOPE SENTENCE
{
//Exclamation mark does not contribute to the score
DOMAIN(dom1:NORMAL)
{
LEMMA("tennis")
<>
LEMMA("match")
>>
!LEMMA("parliament")
}
}
Breakdown:
- Number of operands present on both sides of a positional or logical operator: 2.
- Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 2 * 2 = 4.
- Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.
- Score generated by operand, following the exclamation mark: 0.
- Sum of the two scores of all operands: 40 + 0 = 40.
Options
You can select the scoring options to affect the scoring mechanism.
FIXED_SCORE
When this option is set, step one of the algorithm is replaced by:
- The amount of points is set equal to the amount associated with the rule's score option.
In other words, the amount of points before the application of the multiplication factors depends only on the rule's score option (default: NORMAL
), the number of operands and the number of matched tokens no longer have any impact.
STATIC_SCORE
When this option is set, step one of the algorithm is replaced by:
- Each operand contributes a fixed amount equal to the amount associated with the rule's score option.
In other words, the number of matched tokens is no longer considered. Each operand's contribution to the score will not change in accordance to the length of the matched text, but rather it will be equal to the nominal score which was declared for the rule (default: NORMAL
).
CHILD_TO_FATHER
When this option is set, if a domain has a parent in the taxonomy, the domain's score will be added to its parent's score. In this way, a parent domain will receive the score of all its children.
This is called propagation of score from children to fathers and it is a recursive mechanism, meaning that even a father which is child of a higher domain, will transmit its score (possibly inherited from its children) to its respective father.
This option can be useful when high-level domains with no children have many rules and other high-level domains have no rules, but have children with a few rules each.
Based on categorization rules only, high-level domains with no children would probably have the best of "weaker" subdomains. With this option set, domains with descendants become contenders.
For example, suppose that for this taxonomy:
environment
climate change
global warming
conservation
energy saving
parks
science and technology
some rules are defined for the global warming, energy saving and parks domains and relatively many more rules are defined for the first level domain science and technology.
If a text is about technologies with reduced environmental impact, the possible score after the application of the rules could be:
Domain | Score |
---|---|
environment | 0 |
climate change | 0 |
global warming | 10 |
conservation | 0 |
energy saving | 10 |
parks | 10 |
science and technology | 20 |
In this case, science and technology would win, even if several sub-topics of environment were detected.
However, by activating the propagation of the scores from children to fathers, the situation would change as follows:
Domain | Score |
---|---|
environment | 30 (climate change score [10] + conservation score [20]) |
climate change | 10 (global warming score [10]) |
global warming | 10 |
conservation | 20 (energy saving score [10] + parks score [10]) |
energy saving | 10 |
parks | 10 |
science and technology | 20 |
therefore, the environment domain would be the winner.
As can be seen, the score of the children domains was propagated to the respective fathers and the score of fathers was in turn propagated to the first level domain, which thus accumulated all the points of its descendants.
Expert.ai text intelligence engine also offers a second score, called compound, which can be useful in these cases. In the default setting, its value "copies" that of the standard score, so there's no reason to use it. However, when the CHILD_TO_FATHER
option is set, this additional score is calculated in a way that amplifies the effects of the propagation from children to fathers.
With the option set, after the first step illustrated above, a second step, which affects only the compound score, is performed. In this step, starting from the children domains, the propagation is repeated while taking into account the pre-existing score. Therefore, using the above example, the conservation domain would now be assigned to have 40 points as a result of the sum of the 20 points of its children and its 20 points score calculated in the previous step.
Therefore, keeping in mind that the standard score has not changed, the final compound values for the example would be:
Domain | Score | Compound |
---|---|---|
environment | 30 | 90 (domain own score [30] + climate change score [20] + conservation score [40]) |
climate change | 10 | 20 (domain own score[10] + global warming score [10]) |
global warming | 10 | |
conservation | 20 | 40 (domain own score [20] + energy saving score [10] + parks score [10]) |
energy saving | 10 | |
parks | 10 | |
science and technology | 20 |