Skip to content

Categorization peculiarities

This section describes the peculiarities of the rules language regarding the categorization task.

As written in the introduction, categorization consists in determining what a document is about and the possible domains (the categories) to choose from those indicated in the taxonomy. All categorization projects include a taxonomy, the latter containing all of the domains of a given project.

Note

Taxonomies do not apply to extraction tasks.

To make a comparison with the "spotter cards" mentioned in the introduction, the taxonomy must contain "the names of all the planes" that could potentially be be identified.

For example, here is a possible taxonomy of a project in which the engine is required to categorize news about a professional basketball association such as the NBA:

CONFERENCE
    EasternConference
    WesternConference
SEASON
    Regular
    Playoffs
    Finals
FOUL
    PersonalFoul
    FlagrantFoul
    TechnicalFoul
Money
OTHERNEWS
    Awards
        MVP
        DefensivePlayer
        RookiePlayer
        TopScorer
    Retirement
    CoachingChanges
    NBADraft

The taxonomy can be considered as a hierarchical tree structure; in fact it is also called "domain tree". It usually reflects all or part of a knowledge domain, hence the usefulness of the hierarchical structure. However, it is also possible to define a flat taxonomy that is a list of non-interdependent elements.

In a project, the taxonomy is not defined within the rules language source code, rather, it is an external data structure which is defined using the graphical development tool.

Each domain has a unique name, in other words, there can not be two or more domains with the same name within the same taxonomy. While each categorization rule refers to a domain by its name, the domains also have an optional description.

If triggered, a categorization rule will attribute a certain amount of points (score) to the domain to which it is associated. The categories which receive the most points are considered the "winners". The output resulting from an input document processed by a categorization engine will consist in one or more domains and their corresponding scores.
The following topics cover the relationship between rules and domains as well as the domain scoring mechanism.