Skip to content

The scoring mechanism

The basic algorithm

The basic algorithm used to determine the amount of points generated by a categorization rule consists in the amount of points by which the rule domain is increased each time the rule is activated.

  1. For each operand of rule's condition:
    1. The number of matched tokens is computed, considering that a single KEYWORD or PATTERN attribute can match two or more tokens.
    2. The number of tokens is multiplied by the amount associated with the score option (NORMAL: 10, HIGH: 15, etc.).
  2. The amounts computed for all the operands are summed.
  3. If the section, in which the rule's condition was met, has a score multiplication factor, the amount of points will be multiplied by that factor.
  4. If the rule's condition was met inside a segment and that segment has a score multiplication factor, the amount of points will be multiplied by that factor.

A more complex algorithm is used in the case of conditions containing positional sequence operators.

KEYWORD attribute contribution

A single KEYWORD attribute generates an amount of points that's proportional to the number of matched tokens, and this number, in turn, is influenced by the number of lemmas.
For example, in this text:

This credit card can help you establish a positive credit history.

the disambiguator recognizes credit card and credit history as multi-word lemmas.
This rule:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("credit card")
    }
}

is activated by the sample text and generates a NORMAL amount of points (10). This is because credit card corresponds to one token in the disambiguation output.

On the other hand, this rule:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("can help")
    }
}

generates the NORMAL amount of points twice—that is, 10 X 2 = 20—because can and help correspond to two lemmas and hence to two tokens.

If the rule is changed to:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("credit card can help")
    }
}

the amount of points generated by the KEYWORD attribute will be 30, 20 due to can and help plus 10 due to credit card.

If more expressions are specified within the same attribute, as in:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("help", "can")
    }
}

the amount of points generated by each rule activation depends on the single expression.
In the case of the sample text, the rule is activated twice, the first time because of can and the second because of help. The amount of points generated by the first activation is 10 and the amount generated by the second activation is also 10. The sum of points for the domain is 20, because of the two distinct hits and not due to the single activation as seen in the first sample rule.

Boolean combinations

A condition made of two or more operands combined with Boolean operators (AND, OR, AND NOT, XOR) generates an amount of points proportional to the number of the operands that match.

The following rule, for example, always generates a score of 20, 10 for each operand:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        AND
        LEMMA("match")
    }
}

On the other hand, this rule:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        OR
        LEMMA("match")
    }
}

generates a variable amount of points: 10 points if only one of the two operands finds a match, 20 points when both operands match.

Complex Boolean combinations are deconstructed into simpler expressions.
For example, this rule:

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        AND
        (
            LEMMA("match")
            OR
            LEMMA("series")
        )
    }
}

has an inner OR combination of two operands that can generate 10 or 20 points based on the number of operands that match, which can be only one or both.
The first operand always generates 10 points, so the total amount of points generated by the rule can be 10 + 10 = 20 or 10 + 20 = 30.

Any operand or sub-condition which follows the AND NOT operator will not generate any points.
Therefore, every hit from the following rule will generate 20 points, as each of the two operands combined with AND generates 10 points while the third operand does not generate any points.

SCOPE SENTENCE
{
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        AND
        LEMMA("match")
        AND NOT
        LEMMA("terrorism")
    }
}

Complex condition with positional or logical operators

A complex condition—containing more than one operand, co-joined with positional or logical operators—generates a score based on the following variant of the basic algorithm:

  • Positional and logical operators become part of the operands.
  • A condition built with these operators generates the square product of the number of the operands on both sides of operators, multiplied by the score defined in the header of the rule.

For example, this rule generates 40 points:

SCOPE SENTENCE
{
    // Strict sequence of two one-token operands
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        >>
        LEMMA("match")
    }
}

The points are computed as follows:

  • Number of operands present on both sides of a positional or logical operator: 2.
  • Square product of the number of operands, on both sides of the positional or logical operator in the rule: 2 * 2 = 4.
  • Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.

A rule constructed as follows:

SCOPE SENTENCE
{
    // Flexible sequence of three one-token operands
    DOMAIN(dom1:HIGH)
    {
        LEMMA("member")
        <>
        KEYWORD("of")
        <>
        LEMMA("House of Commons")
    }
}

has three operands, each which matches one token, and are conjoined by the flexible sequence operator; the score option is set to HIGH, therefore this rule generates 135 points. Here is the breakdown:

  • Number of operands present on both sides of a positional or logical operator: 3.
  • Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 3 * 3.
  • Square product of the number of the operands multiplied by the score defined in the header of the file: 9 * 15 = 135.

Complex combinations are decomposed in simpler expressions. Consider for example this rule:

SCOPE SENTENCE
{
    // Combination of flexible sequence and Boolean operator
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        <>
        LEMMA("match")
        AND NOT
        LEMMA("terrorism")
    }
}

Every instance of this rule generates 50 points:

  • Number of operands present on both sides of a positional or logical operator: 2.
  • Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 2 * 2.
  • Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.
  • Score generated by operand, not included by a positional or logical operator: 10
  • Sum of the score of all the operands: 40 + 10 = 50.

If two tokens disambiguated as one lemma are matched by two different operands, each operand will add its individual value to the final score of the rule, regardless of the disambiguation.
The following rule, for example, generates 40 points, because of the two operands co-joined by a positional operator.

SCOPE SENTENCE
{
    // Strict sequence of two words of the same lemma.
    DOMAIN(dom1:NORMAL)
    {
        KEYWORD("credit")
        >>
        KEYWORD("card")
    }
}

Operands following the exclamation mark do not contribute to points generation.
The following rule, for example, generates 40 points:

SCOPE SENTENCE
{
    //Exclamation mark does not contribute to the score
    DOMAIN(dom1:NORMAL)
    {
        LEMMA("tennis")
        <>
        LEMMA("match")
        >>
        !LEMMA("parliament")
    }
}

Breakdown:

  • Number of operands present on both sides of a positional or logical operator: 2.
  • Square product of the number of operands, present on both sides of the positional or logical operator in the rule: 2 * 2 = 4.
  • Square product of the number of the operands multiplied by the score defined in the header of the rule: 4 * 10 = 40.
  • Score generated by operand, following the exclamation mark: 0.
  • Sum of the two scores of all operands: 40 + 0 = 40.

Options

You can select the scoring options to affect the scoring mechanism.

FIXED_SCORE

When this option is set, step one of the algorithm is replaced by:

  1. The amount of points is set equal to the amount associated with the rule's score option.

In other words, the amount of points before the application of the multiplication factors depends only on the rule's score option (default: NORMAL), the number of operands and the number of matched tokens no longer have any impact.

STATIC_SCORE

When this option is set, step one of the algorithm is replaced by:

  1. Each operand contributes a fixed amount equal to the amount associated with the rule's score option.

In other words, the number of matched tokens is no longer considered. Each operand's contribution to the score will not change in accordance to the length of the matched text, but rather it will be equal to the nominal score which was declared for the rule (default: NORMAL).

CHILD_TO_FATHER

When this option is set, if a domain has a parent in the taxonomy, the domain's score will be added to its parent's score. In this way, a parent domain will receive the score of all its children.

This is called propagation of score from children to fathers and it is a recursive mechanism, meaning that even a father which is child of a higher domain, will transmit its score (possibly inherited from its children) to its respective father.

This option can be useful when high-level domains with no children have many rules and other high-level domains have no rules, but have children with a few rules each.
Based on categorization rules only, high-level domains with no children would probably have the best of "weaker" subdomains. With this option set, domains with descendants become contenders.

For example, suppose that for this taxonomy:

environment
    climate change
        global warming
    conservation
        energy saving
        parks
science and technology

some rules are defined for the global warming, energy saving and parks domains and relatively many more rules are defined for the first level domain science and technology.
If a text is about technologies with reduced environmental impact, the possible score after the application of the rules could be:

DomainScore
environment
0
   climate change
0
       global warming
10
   conservation
0
       energy saving
10
       parks
10
science and technology
20

In this case, science and technology would win, even if several sub-topics of environment were detected.
However, by activating the propagation of the scores from children to fathers, the situation would change as follows:

DomainScore
environment
30 (climate change score [10] + conservation score [20])
   climate change
10 (global warming score [10])
       global warming
10
   conservation
20 (energy saving score [10] + parks score [10])
       energy saving
10
       parks
10
science and technology
20

therefore, the environment domain would be the winner.
As can be seen, the score of the children domains was propagated to the respective fathers and the score of fathers was in turn propagated to the first level domain, which thus accumulated all the points of its descendants.

Expert.ai text intelligence engine also offers a second score, called compound, which can be useful in these cases. In the default setting, its value "copies" that of the standard score, so there's no reason to use it. However, when the CHILD_TO_FATHER option is set, this additional score is calculated in a way that amplifies the effects of the propagation from children to fathers.
With the option set, after the first step illustrated above, a second step, which affects only the compound score, is performed. In this step, starting from the children domains, the propagation is repeated while taking into account the pre-existing score. Therefore, using the above example, the conservation domain would now be assigned to have 40 points as a result of the sum of the 20 points of its children and its 20 points score calculated in the previous step.
Therefore, keeping in mind that the standard score has not changed, the final compound values for the example would be:

DomainScoreCompound
environment
3090 (domain own score [30] + climate change score [20] + conservation score [40])
   climate change
1020 (domain own score[10] + global warming score [10])
       global warming
10
   conservation
2040 (domain own score [20] + energy saving score [10] + parks score [10])
       energy saving
10
       parks
10
science and technology
20