Skip to content

onCategorizer introduction

Manipulating categorization results via script

Scripting can be used to post-process the categorization output.

Manipulating categorization output allows refining results in a way that rules alone cannot achieve. For example omit the child domains of domains already present in the results, transfer the score from one domain to another or, more simply, return only domains with significant scores.

Manipulation must take place in the onCategorizer event handler function, which is executed when the Categorizer event is fired, that is after the evaluation of categorization rules.

function onCategorizer() {

// Put here statements that change categorization results

}

Note

The following names can't be used as variables names, otherwise an error will occur:

  • FATHER
  • SON
  • ANCESTOR
  • DESCENDANT
  • SIBLING
  • RELATIVE
  • FOUNDER

Sample taxonomy

In this section of the book dedicated to the manipulation of categorization results, we will use this taxonomy as a reference for most of the examples—the part in brackets is the domain label:

1   Sport
    1.01    martial art
        1.01.1  aikido
        1.01.2  judo
        1.01.3  karate
    1.02    athletics
    1.03    baseball
    1.04    football
    1.05    gymnastics
        1.05.1  artistic gymnastics
        1.05.2  rhythmic gymnastics
    1.06    american football
    1.07    golf
    1.08    hockey
        1.08.1  ice hockey
        1.08.2  field hockey
        1.08.3  roller hockey
    1.09    horse racing
    1.10    swimming
    1.11    volleyball
    1.12    basketball
    1.13    handball
    1.14    water polo
    1.15    rugby
    1.16    fencing
    1.17    skiing
    1.18    tennis

The hidden results table

You can think of the initial categorization results—the outcome of the activation of categorization rules—as a table like the one shown below.

Domain ID Domain label Score Compound score Frequency
1 Sport 90 90 52.94%
1.01 martial art 60 60 35.29%
1.07 golf 10 10 5.88%
1.15 rugby 10 10 5.88%

Score is the sum of the points that the rules being activated by the text have assigned to the domain. Unless the CHILD_TO_FATHER option is set, Compound score is a copy of Score. Both scores are influenced by the CHILD_TO_FATHER option.

Frequency is a function of the score and it's computed as follows:

and expressed as a percentage. So for example, the frequency of category 1 is:

Final output and the influence of onCategorizer

The final output of the text intelligence engine is a list of "winner" results.
If the onCategorizer function is not defined, all the results in the hidden table will be considered "winners" and thus returned as output.

If instead the function is defined, the engine will look at the predefined set named WINNERS to determine "winning" results. The set is initially empty: if the function's code doesn't populate it, there will be no winners and the engine will not return any results.

In other words, if onCategorizer is defined, results present in the hidden table will all be considered "losers", unless they are referenced in the WINNERS set.
"Losing" results are not returned in output, but are still visible, with appropriate settings, in the Studio development environment.

Flow

If you want to manipulate categorization results, define the onCategorizer function and put in it scripting code that:

  1. Copies the contents of the predefined set ALL in a user-defined set.
  2. Manipulates the user-defined set, possibly using other user-defined sets, with specific functions—filters, intersections, deletions, score transformations, etc.
  3. Copies the results of the manipulation in the WINNERS predefined set.