Experiments engine setup parameters

The parameters used when starting a categorization experiment or an extraction experiment are listed in the following tables.

The advanced parameters are included, they are marked with a blue caption in italics. If you want to hide them, select Hide advanced parameters.

Categorization

Auto-ML categorization

Training docs

Select an annotated library to perform an experiment in Training library.
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection

Model type

Parameter	Description
Linear SVM	Linear SVM classifier model: standard Support Vector Machine using linear regression margins
Probabilistic SVM	Probabilistic SVM classifier model: Support Vector Machine using probability distribution prediction scores
Custom kernel SVM	Custom kernel SVM classifier model: Support Vector Machine using custom kernel
SGD	SGD classifier model: Stochastic Gradient Descent learning mechanism on a Linear SVM model
GBoost	GBoost classifier model: Gradient boosting technique of stacking decision tree models, sequentially training on residual errors
XGBoost	XGBoost classifier model: Extreme gradient boosting technique using more accurate approximations over a GBoost model
Random Forest	Random Forest classifier model: ensemble of decision trees using combined majority predictions
Logistic Regression	Logistic Regression classifier model: logistic function used to model probabilities of possible outcomes
Multinomial Naive Bayes	Multinomial Naïve Bayes classifier model: standard Naïve Bayes model using conditional probability of words to determine predictions
Complement Naive Bayes	Complement Naïve Bayes classifier model: multinomial Naïve Bayes model improved by using statistics from the complement of each class to compute model weights

Warning

If you select one of the following models:
- Probabilistic SVM
- GBoost
- XGBoost
- Random Forest
- Logistic Regression
- Multinomial Naive Bayes
- Complement Naive Bayes
the Auto ML parameters and the F-Beta parameters won't be available.
If you select the Custom kernel SVM model, the Auto ML parameters won't be available.
If you select more than one model type, the Feature space parameters, Hyper parameters, F-Beta parameters and Auto ML parameters won't be available.

Problem definition

Parameter	Description
Enable strict "single label" mode	Enable strict single label mode (off by default)
Enable strict "Sub document categorization" compatibility mode	Activating this mode will enable the trained model to use annotated strings of text to predict classes ("sub document categorization"). Switched off by default. At the current moment, annotation and testing must be done outside of the Platform.

Warning

If you switch on the Enable strict "single label" mode parameter, the F-Beta parameters will not be available.

Feature space

Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameters combination by Platform.

If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:

Parameter	Description
Word form	Occurrence of a keyword. Set on Use by default.
Word base form (Lemma)	Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Use by default
Main lemma	Document-level most representative lemmas. Set on Don't use by default
Word base form stem	Stem of a word (e.g. “intern” is the stem of “international”). Set on Use by default
Sub-words	Unit smaller than a word (e.g. morphemes, stems and endings, roots, etc.). Set on Use by default
Entities	Entities (e.g. persons, organizations, etc.). Set on Use by default
Syncons	Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default
Main Syncons	Document-level most representative syncons. Set on Use by default
Syncon Topics	Generalized main subjects being discussed (e.g. “mammal” as a concept in “the tiger is a mammal” has topic “zoology”). Set on Use by default
Main Topics	Document-level most representative topics. Set on Use by default
Knowledge Label	Pre-defined parent syncon (e.g. “legal action” is the knowledge label for “moratorium”). Set on Don't use by default
Knowledge Graph relations	Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Use by default
Use word embeddings	Set on Don't use by default

Hyperparameters

Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.

If Activate Auto-ML on every parameter is off:

Select Enable strict "Sub document categorization" compatibility mode to employ the trained model in "Sub document categorization" mode (by exporting it). Off by default.
Select Enable strict "single label" mode to activate the single label mode. Off by default.

SVM C parameter: penalty for misclassifications	.
0.001	0.01
0.05	0.1
0.2	0.3
0.4	0.5
0.6	0.7
0.8	0.9
1 (by default)	5

Class weight	.
Balanced (by default)	None

F-Beta

Parameter	Description
Enable F Beta optimization (tuning balance between precision and recall)	Disabled by default. Enable it to set the other parameters.
Target F Beta	A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2 (1 by default)

Auto ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter	Description
Number of training iterations for the AutoML algorithm	Number between 20 and 100 (30 by default)
Number of data splits for cross-validation of AutoML algorithm	Number between 2 and 10 (3 by default)
Call back function for stopping the AutoML self-tuning process	You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes)	Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation from the Call back function for stopping the AutoML self-tuning process parameter, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.

Explainable categorization

Training docs

Select an annotated library to perform an experiment in Training library.
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection

Generic parameters

Parameter	Description
Enable "onCategorizer" optimization	Enabled by default
Enable "strict" hierarchical mode	Disabled by default
Enable "single label" mode	Disabled by default

Note

If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters are not available.

Rules Generation

Parameter	Description
Enable generation of syncon based rules	Enabled by default
Enable generation of ancestor based rules	Disabled by default
Max number of items in each rule	Number between 2 and 4, 3 by default
Max number of rules for each taxonomy category	Number between 5 and 1000, 200 by default
Min number of annotated documents for a category, to enable rules generation	Number between 2 and 1000, 5 by default
Max number of rules in which any single item can participate	Number between 2 and 200, 40 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter	Description
Desired Clean level	Automatic by default
Default clean level (initial value if auto is selected)	10 by default
Desired filter sequence	Automatic by default
Default filter sequence (initial value if auto is selected)	40, 80, 90, 90, 90 by default
Enable conservative clean	Disabled by default
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit)	Number greater than -1 (set by default)

F-Beta

Parameter	Description
Target F Beta	A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2, 0,75 by default

Bootstrapped Studio Project

Training docs

Select an annotated library to perform an experiment in Training library.
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection

Rules Generation

Parameter	Description
Enable generation of syncon based rules	Enabled by default
Enable generation of ancestor based rules	Disabled by default
Max number of items in each rule	Number between 1 and 3, 2 by default
Max number of rules for each taxonomy category	Number between 5 and 50, 20 by default
Min number of annotated documents for a category, to enable rules generation	Number between 2 and 100, 5 by default
Max number of rules in which any single item can participate	Number between 2 and 20, 5 by default
Max number of element in a single item of a rule	Number between 1 and 3, 2 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter	Description
Desired Clean level	10 by default
Desired filter sequence	40, 80, 90 by default

Extraction

Auto-ML Extraction

Training docs

Select an annotated library to perform an experiment in Training library.
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection

Model type

Parameter	Description
CRF	CRF entity extraction model: Conditional Random Fields probabilistic model designed for sequence labeling
SVM sliding window	Support Vector Machine using a sequence tagging approach translated into a local linear SVC classifier

Note

If you select the SVM sliding window model, the Hyperparameters and the F-Beta parameters won't be available.
If you select more than one model type the Feature space parameters, Hyperparameters, F-Beta parameters and Auto ML parameters won't be available.

Feature space

Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameter combination by Platform.

If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:

Parameter	Description
Word base form (Lemma)	Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Don't use by default
Logic dependencies	Relationships and dependencies of the word (e.g. “subject” – “relationship type” – “object” relationships). Set on Use by default
Word Part-of-Speech	Part-of-speech of a word (e.g. noun, verb, etc.). Set on Use by default
Collocations	Combination of words frequently used together which have a specific meaning (e.g. “regular exercise” or “to take a risk”). Set on Use by default
Phrases	Combination of words that together create a singular meaning (e.g. “to look after” or “on the table”). Set on Don't use by default
Syncons	Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default
Known Concepts	Specific meaning of a term that is available in the Expert.ai Knowledge Graph (e.g. “Italy” is a specific country, “World Cup” is a specific football tournament). Set on Don't use by default
Entities	Entities (e.g. persons, organizations, etc.). Set on Use by default
Knowledge Graph relations	Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Don't use by default
Title case words	Toggle title case as a feature in the word vector. Set on Use by default
Upper case word	Toggle uppercase as a feature in the word vectors. Set on Use by default
Digit words	Toggle digit as a feature in the word vector. Set on Use by default
Mixed case words	Toggle mixed case as a feature in the word vector. Set on Use by default
Alpha Numeric words	Toggle alpha numeric as a feature in the word vector. Set on Use by default
Alphabetic words	Toggle alphabetic as a feature in the word vector. Set on Use by default
Numeric words	Toggle numeric as a feature in the word vector. Set on Use by default
Decimal number words	Toggle decimal number as a feature in the word vector. Set on Use by default

Hyperparameters

Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.

CRF c1 regularization coefficient	.
0	0.00001
0.0001	0.001 (By default)
0.01	0.05
0.1	0.3
0.5	0.8
1	2
5	10
100

CRF c2 regularization coefficient	.
0.00001	0.0001
0.001 (By default)	0.01
0.05	0.1
0.3	0.5
2	5
10	100

F-Beta

Parameter	Description
Enable F Beta optimization (tuning balance between precision and recall)	Disabled by default. Enable it to set the parameter below
Target F Beta	A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 5 (1 by default)

Auto-ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter	Description
Number of training iterations for the AutoML algorithm	Number between 20 and 100 (30 by default)
Number of data splits for cross-validation of AutoML algorithm	Number between 2 and 10 (3 by default)
Call back function for stopping the AutoML self-tuning process	You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes)	Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.

Explainable Extraction

Training docs

Select an annotated library to perform an experiment in Training library.
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection

Rules generation

Parameter	Description
Maximum number of conditions for any given rule	Number between 1 and 5. 3 by default
Enable automatic minimum support setup	Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default
Custom minimum support threshold	Enabled only if the previous parameter is switched off. Number greater than 2. 5 by default
Enable automatic minimum support setup	Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default
Custom minimum confidence to explore a rule	Enabled only if the previous parameter is switched off. Number between 0.001 and 0.2. 0.05 by default
Minimum acceptance confidence threshold	Number between 0.2 and 0.95. 0.6 by default
Minimum confidence improvement for adding a new condition to a rule	Number between 0.01 and 0.2. 0.01 by default
Enable concatenation of contiguous extractions	Switched off by default

Feature options

Parameter	Description
Window size (in tokens) to the left of the token being predicted	Number between 0 and 5. 3 by default
Window size (in tokens) to the right of the token being predicted	Number between 0 and 5. 3 by default
Minimum document frequency	Number greater than 1. 2 by default
Raw word form	The word itself. On by default
Word base form (Lemma)	Base form of a word (lemma) (e.g. “run” for “running” or “ran”). On by default
Word Part-of-Speech	Part-of-speech of a word (e.g. noun, verb, etc.). On by default
Syncons	Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). On by default
Ancestors	More abstract concepts related to syncons. On by default
Numeric words:	Toggle numeric as a feature in the word vector. Off by default
Use suffix of a word	Off by default
Use prefix of a word	Off by default

Rules selection

Options for selecting best rules

Parameter	Description
Fine-tuning rules selecting only the most significant ones	On by default
Number of rules selection steps	Number between 20 and 100. 50 by default
Fraction of validation split	Number between 0.1 and 0.9. 0.2 by default
Activate rules pruning	Off by default
Max number of rules to select	Enabled only if the previous parameter is switched off. Number between 1 and 1000. 100 by default