The scoring mechanism
The engine first determines the confidence score of all instances, then the confidence of fields as the average of the instances' scores.
For example, with this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA)
{
@FULL_NAME[TYPE(NPH)]
}
}
applied to this text:
John Smith is the CEO of Acme Ltd, founded in 1985. To this day, John lives in New York. He has 3 sons.
you will get these instances:
Instance text | Confidence score |
---|---|
John Smith | 1.00 |
John | 1.00 |
He | 1.00 |
and this field:
Field | Value | Confidence score |
---|---|---|
FULL_NAME |
John Smith | 1.00 |
In this other case, with this rule:
SCOPE SENTENCE
{
IDENTIFY(PERSONAL_DATA:LOW)
{
@FULL_NAME[TYPE(NPH)]
}
}
applied to the same text above, you will get these instances:
Instance text | Confidence score |
---|---|
John Smith | 0.25 |
John | 0.25 |
He | 0.25 |
and this field:
Field | Value | Confidence score |
---|---|---|
FULL_NAME |
John Smith | 0.25 |
because the LOW
score option corresponds to 25% of the default score (1.0).
In more complex cases, an extraction can be determined by multiple rules with different score options, for example:
CONFIDENCE
{
@VERYHIGH:80
}
SCOPE SENTENCE
{
IDENTIFY(HYSTORICAL_CHARACTERS:NORMAL)
{
@FULL_NAME[TYPE(NPH)]
}
IDENTIFY(HYSTORICAL_CHARACTERS:VERYHIGH)
{
@FULL_NAME[SYNCON(100001048)]//@SYN: #100001048# [Julius Caesar]
}
}
When the above rules are applied to this text:
The Senate began bestowing honors on Julius Caesar while he was still campaigning in Hispania.
you will get these instances:
Instance text | Confidence score |
---|---|
Julius Caesar | 0.89 |
he | 0.50 |
and this field:
Field | Value | Confidence score |
---|---|---|
FULL_NAME |
Julius Caesar | 0.69 |
Instance he gets score of 0.50 because it's extracted by the first rule with the NORMAL
option corresponding to the 50% of 1.0.
Instance Julius Caesar, instead, has been matched by both rules because it is a proper noun and corresponds to syncon 100001048. In this case the confidence score is a computed with this formula:
highestScore + (difference * lowerScore1) [+ (difference * lowerScore2) + ...(difference * lowerScoreN)]
where:
highestScore
is the highest score assigned by a rule.lowerScore#
is the score of the other rules.difference
is the difference between the default confidence score (1.00) and highest score above.
In the example case, such a formula would be:
0.80 + (0.20 * 0.50)
because highestScore
corresponds to the custom VERYHIGH
score option (0.80), lowerScore
is 0.50, which corresponds to the NORMAL
score option of the first rule, and difference
is 0.20, that is the difference between the default score and highestScore
(1.00 - 0.80 = 0.20).
This gives the instance a score of 0.89.
The field score is 0.69, which is the average of 0.50 and 0.89.