Skip to content

Training and test libraries

In machine learning, training and test sets are used to build the mathematical models capable of making predictions on input data.
In case of Platform, models are "text intelligence engines" capable of carrying out document classification and information extraction.
The training set is the set of examples that allows the machine learning engine to learn to make predictions. In Platform, the training set is called training library and it contains documents that are annotated with the expected results in terms of classification and extraction.
The equivalent of the test set is the test library, also annotated with expected results and used to test the ML model.

So, basically, you use a training library to create the ML model and a test library to verify that it has learned. Annotations inside the training set make the machine "learn", annotations inside the test library determine prediction quality.

One way to produce training sets and test sets is to start from a single larger data set and partition it, typically putting 75% of the documents in the training set and the remaining 25% in the test set.

Platform allows you to create other generic libraries at will, for example to use them as validation sets. In these cases, possibly evaluate a different partitioning of the initial dataset, for example 70% for the training library, 15% in a generic library used as a validation test and the other 15% in the test library.

When you run an experiment, the object of the experiment itself is the test library. It is there that classification or extraction are carried out and results are shown for example in terms of precision and recall.