Skip to content

The scoring mechanism

The engine first determines the confidence score of all instances, then the confidence of fields as the average of the instances' scores.

For example, with this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA)
    {
        @FULL_NAME[TYPE(NPH)]
    }
}

applied to this text:

John Smith is the CEO of Acme Ltd, founded in 1985. To this day, John lives in New York. He has 3 sons.

you will get these instances:

Instance text Confidence score
John Smith 1.00
John 1.00
He 1.00

and this field:

Field Value Confidence score
FULL_NAME John Smith 1.00

In this other case, with this rule:

SCOPE SENTENCE
{
    IDENTIFY(PERSONAL_DATA:LOW)
    {
        @FULL_NAME[TYPE(NPH)]
    }
}

applied to the same text above, you will get these instances:

Instance text Confidence score
John Smith 0.25
John 0.25
He 0.25

and this field:

Field Value Confidence score
FULL_NAME John Smith 0.25

because the LOW score option corresponds to 25% of the default score (1.0).

In more complex cases, an extraction can be determined by multiple rules with different score options, for example:

CONFIDENCE
{
    @VERYHIGH:80
}

SCOPE SENTENCE
{
    IDENTIFY(HYSTORICAL_CHARACTERS:NORMAL)
    {
        @FULL_NAME[TYPE(NPH)]
    }

    IDENTIFY(HYSTORICAL_CHARACTERS:VERYHIGH)
    {
        @FULL_NAME[SYNCON(100001048)]//@SYN: #100001048# [Julius Caesar]
    }
}

When the above rules are applied to this text:

The Senate began bestowing honors on Julius Caesar while he was still campaigning in Hispania.

you will get these instances:

Instance text Confidence score
Julius Caesar 0.89
he 0.50

and this field:

Field Value Confidence score
FULL_NAME Julius Caesar 0.69

Instance he gets score of 0.50 because it's extracted by the first rule with the NORMAL option corresponding to the 50% of 1.0.
Instance Julius Caesar, instead, has been matched by both rules because it is a proper noun and corresponds to syncon 100001048. In this case the confidence score is a computed with this formula:

highestScore + (difference * lowerScore1) [+ (difference * lowerScore2) + ...(difference * lowerScoreN)]

where:

  • highestScore is the highest score assigned by a rule.
  • lowerScore# is the score of the other rules.
  • difference is the difference between the default confidence score (1.00) and highest score above.

In the example case, such a formula would be:

0.80 + (0.20 * 0.50)

because highestScore corresponds to the custom VERYHIGH score option (0.80), lowerScore is 0.50, which corresponds to the NORMAL score option of the first rule, and difference is 0.20, that is the difference between the default score and highestScore (1.00 - 0.80 = 0.20).
This gives the instance a score of 0.89.

The field score is 0.69, which is the average of 0.50 and 0.89.