Categorization peculiarities
This section describes the peculiarities of the rules language regarding the categorization task.
As written in the introduction, categorization consists in determining what a document is about and the possible domains (the categories) to choose from those indicated in the taxonomy. All categorization projects include a taxonomy, the latter containing all of the domains of a given project.
Note
Taxonomies do not apply to extraction tasks.
To make a comparison with the "spotter cards" mentioned in the introduction, the taxonomy must contain "the names of all the planes" that could potentially be be identified.
For example, here is a possible taxonomy of a project in which the engine is required to categorize news about a professional basketball association such as the NBA:
CONFERENCE
EasternConference
WesternConference
SEASON
Regular
Playoffs
Finals
FOUL
PersonalFoul
FlagrantFoul
TechnicalFoul
Money
OTHERNEWS
Awards
MVP
DefensivePlayer
RookiePlayer
TopScorer
Retirement
CoachingChanges
NBADraft
The taxonomy can be considered as a hierarchical tree structure; in fact it is also called "domain tree". It usually reflects all or part of a knowledge domain, hence the usefulness of the hierarchical structure. However, it is also possible to define a flat taxonomy that is a list of non-interdependent elements.
In a project, the taxonomy is not defined within the rules language source code, rather, it is an external data structure which is defined using the graphical development tool.
Each domain has a unique name, in other words, there can not be two or more domains with the same name within the same taxonomy. While each categorization rule refers to a domain by its name, the domains also have an optional description.
If triggered, a categorization rule will attribute a certain amount of points (score) to the domain to which it is associated. The categories which receive the most points are considered the "winners". The output resulting from an input document processed by a categorization engine will consist in one or more domains and their corresponding scores.
The following topics cover the relationship between rules and domains as well as the domain scoring mechanism.