Skip to content

Experiments engine setup parameters

The parameters used when starting a categorization experiment or an extraction experiment are listed in the following tables.

The advanced parameters are included, they are marked with a blue caption in italics. If you want to hide them, select Hide advanced parameters.

Categorization

Auto-ML categorization

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated annotated documents (strict)
    • Only validated or annotated documents (strict) (Selected by default)
    • Prefer validated documents
    • Prefer annotated documents
    • Random selection

Model type

Parameter Description
Linear SVM Linear SVM classifier model: standard Support Vector Machine using linear regression margins
Probabilistic SVM Probabilistic SVM classifier model: Support Vector Machine using probability distribution prediction scores
Custom kernel SVM Custom kernel SVM classifier model: Support Vector Machine using custom kernel
SGD SGD classifier model: Stochastic Gradient Descent learning mechanism on a Linear SVM model
GBoost GBoost classifier model: Gradient boosting technique of stacking decision tree models, sequentially training on residual errors
XGBoost XGBoost classifier model: Extreme gradient boosting technique using more accurate approximations over a GBoost model
Random Forest Random Forest classifier model: ensemble of decision trees using combined majority predictions
Logistic Regression Logistic Regression classifier model: logistic function used to model probabilities of possible outcomes
Multinomial Naive Bayes Multinomial Naïve Bayes classifier model: standard Naïve Bayes model using conditional probability of words to determine predictions
Complement Naive Bayes Complement Naïve Bayes classifier model: multinomial Naïve Bayes model improved by using statistics from the complement of each class to compute model weights

Warning

  • If you select one of the following models:

    • Probabilistic SVM
    • GBoost
    • XGBoost
    • Random Forest
    • Logistic Regression
    • Multinomial Naive Bayes
    • Complement Naive Bayes

    the Auto ML parameters and the F-Beta parameters won't be available.

  • If you select the Custom kernel SVM model, the Auto ML parameters won't be available.

  • If you select more than one model type, the Feature space parameters, Hyper parameters, F-Beta parameters and Auto ML parameters won't be available.

Problem definition

Parameter Description
Enable strict "single label" mode Enable strict single label mode (off by default)
Enable strict "Sub document categorization" compatibility mode Activating this mode will enable the trained model to use annotated strings of text to predict classes ("sub document categorization"). Switched off by default. At the current moment, annotation and testing must be done outside of the Platform.

Warning

If you switch on the Enable strict "single label" mode parameter, the F-Beta parameters will not be available.

Feature space

Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameters combination by Platform.

If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:

Parameter Description
Word form Occurrence of a keyword. Set on Use by default.
Word base form (Lemma) Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Use by default
Main lemma Document-level most representative lemmas. Set on Don't use by default
Word base form stem Stem of a word (e.g. “intern” is the stem of “international”). Set on Use by default
Sub-words Unit smaller than a word (e.g. morphemes, stems and endings, roots, etc.). Set on Use by default
Entities Entities (e.g. persons, organizations, etc.). Set on Use by default
Syncons Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default
Main Syncons Document-level most representative syncons. Set on Use by default
Syncon Topics Generalized main subjects being discussed (e.g. “mammal” as a concept in “the tiger is a mammal” has topic “zoology”). Set on Use by default
Main Topics Document-level most representative topics. Set on Use by default
Knowledge Label Pre-defined parent syncon (e.g. “legal action” is the knowledge label for “moratorium”). Set on Don't use by default
Knowledge Graph relations Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Use by default
Use word embeddings Set on Don't use by default

Hyperparameters

Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.

If Activate Auto-ML on every parameter is off:

  • Select Enable strict "Sub document categorization" compatibility mode to employ the trained model in "Sub document categorization" mode (by exporting it). Off by default.
  • Select Enable strict "single label" mode to activate the single label mode. Off by default.
SVM C parameter: penalty for misclassifications .
0.001 0.01
0.05 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 (by default) 5
Class weight .
Balanced (by default) None

F-Beta

Parameter Description
Enable F Beta optimization (tuning balance between precision and recall) Disabled by default. Enable it to set the other parameters.
Target F Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2 (1 by default)

Auto ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter Description
Number of training iterations for the AutoML algorithm Number between 20 and 100 (30 by default)
Number of data splits for cross-validation of AutoML algorithm Number between 2 and 10 (3 by default)
Call back function for stopping the AutoML self-tuning process You can select:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes) Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation from the Call back function for stopping the AutoML self-tuning process parameter, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.

Explainable categorization

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated annotated documents (strict)
    • Only validated or annotated documents (strict) (Selected by default)
    • Prefer validated documents
    • Prefer annotated documents
    • Random selection

Generic parameters

Parameter Description
Enable "onCategorizer" optimization Enabled by default
Enable "strict" hierarchical mode Disabled by default
Enable "single label" mode Disabled by default

Note

If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters are not available.

Rules Generation

Parameter Description
Enable generation of syncon based rules Enabled by default
Enable generation of ancestor based rules Disabled by default
Max number of items in each rule Number between 2 and 4, 3 by default
Max number of rules for each taxonomy category Number between 5 and 1000, 200 by default
Min number of annotated documents for a category, to enable rules generation Number between 2 and 1000, 5 by default
Max number of rules in which any single item can participate Number between 2 and 200, 40 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter Description
Desired Clean level Automatic by default
Default clean level (initial value if auto is selected) 10 by default
Desired filter sequence Automatic by default
Default filter sequence (initial value if auto is selected) 40, 80, 90, 90, 90 by default
Enable conservative clean Disabled by default
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) Number greater than -1 (set by default)

F-Beta

Parameter Description
Target F Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2, 0,75 by default

Bootstrapped Studio Project

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated annotated documents (strict)
    • Only validated or annotated documents (strict) (Selected by default)
    • Prefer validated documents
    • Prefer annotated documents
    • Random selection

Rules Generation

Parameter Description
Enable generation of syncon based rules Enabled by default
Enable generation of ancestor based rules Disabled by default
Max number of items in each rule Number between 1 and 3, 2 by default
Max number of rules for each taxonomy category Number between 5 and 50, 20 by default
Min number of annotated documents for a category, to enable rules generation Number between 2 and 100, 5 by default
Max number of rules in which any single item can participate Number between 2 and 20, 5 by default
Max number of element in a single item of a rule Number between 1 and 3, 2 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter Description
Desired Clean level 10 by default
Desired filter sequence 40, 80, 90 by default

Extraction

Auto-ML Extraction

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated annotated documents (strict)
    • Only validated or annotated documents (strict) (Selected by default)
    • Prefer validated documents
    • Prefer annotated documents
    • Random selection

Model type

Parameter Description
CRF CRF entity extraction model: Conditional Random Fields probabilistic model designed for sequence labeling
SVM sliding window Support Vector Machine using a sequence tagging approach translated into a local linear SVC classifier

Note

  • If you select the SVM sliding window model, the Hyperparameters and the F-Beta parameters won't be available.

  • If you select more than one model type the Feature space parameters, Hyperparameters, F-Beta parameters and Auto ML parameters won't be available.

Feature space

Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameter combination by Platform.

If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:

Parameter Description
Word base form (Lemma) Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Don't use by default
Logic dependencies Relationships and dependencies of the word (e.g. “subject” – “relationship type” – “object” relationships). Set on Use by default
Word Part-of-Speech Part-of-speech of a word (e.g. noun, verb, etc.). Set on Use by default
Collocations Combination of words frequently used together which have a specific meaning (e.g. “regular exercise” or “to take a risk”). Set on Use by default
Phrases Combination of words that together create a singular meaning (e.g. “to look after” or “on the table”). Set on Don't use by default
Syncons Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default
Known Concepts Specific meaning of a term that is available in the Expert.ai Knowledge Graph (e.g. “Italy” is a specific country, “World Cup” is a specific football tournament). Set on Don't use by default
Entities Entities (e.g. persons, organizations, etc.). Set on Use by default
Knowledge Graph relations Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Don't use by default
Title case words Toggle title case as a feature in the word vector. Set on Use by default
Upper case word Toggle uppercase as a feature in the word vectors. Set on Use by default
Digit words Toggle digit as a feature in the word vector. Set on Use by default
Mixed case words Toggle mixed case as a feature in the word vector. Set on Use by default
Alpha Numeric words Toggle alpha numeric as a feature in the word vector. Set on Use by default
Alphabetic words Toggle alphabetic as a feature in the word vector. Set on Use by default
Numeric words Toggle numeric as a feature in the word vector. Set on Use by default
Decimal number words Toggle decimal number as a feature in the word vector. Set on Use by default

Hyperparameters

Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.

CRF c1 regularization coefficient .
0 0.00001
0.0001 0.001 (By default)
0.01 0.05
0.1 0.3
0.5 0.8
1 2
5 10
100
CRF c2 regularization coefficient .
0.00001 0.0001
0.001 (By default) 0.01
0.05 0.1
0.3 0.5
2 5
10 100

F-Beta

Parameter Description
Enable F Beta optimization (tuning balance between precision and recall) Disabled by default. Enable it to set the parameter below
Target F Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 5 (1 by default)

Auto-ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter Description
Number of training iterations for the AutoML algorithm Number between 20 and 100 (30 by default)
Number of data splits for cross-validation of AutoML algorithm Number between 2 and 10 (3 by default)
Call back function for stopping the AutoML self-tuning process You can select:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes) Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.

Explainable Extraction

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated annotated documents (strict)
    • Only validated or annotated documents (strict) (Selected by default)
    • Prefer validated documents
    • Prefer annotated documents
    • Random selection

Rules generation

Parameter Description
Maximum number of conditions for any given rule Number between 1 and 5. 3 by default
Enable automatic minimum support setup Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default
Custom minimum support threshold Enabled only if the previous parameter is switched off. Number greater than 2. 5 by default
Enable automatic minimum support setup Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default
Custom minimum confidence to explore a rule Enabled only if the previous parameter is switched off. Number between 0.001 and 0.2. 0.05 by default
Minimum acceptance confidence threshold Number between 0.2 and 0.95. 0.6 by default
Minimum confidence improvement for adding a new condition to a rule Number between 0.01 and 0.2. 0.01 by default
Enable concatenation of contiguous extractions Switched off by default

Feature options

Parameter Description
Window size (in tokens) to the left of the token being predicted Number between 0 and 5. 3 by default
Window size (in tokens) to the right of the token being predicted Number between 0 and 5. 3 by default
Minimum document frequency Number greater than 1. 2 by default
Raw word form The word itself. On by default
Word base form (Lemma) Base form of a word (lemma) (e.g. “run” for “running” or “ran”). On by default
Word Part-of-Speech Part-of-speech of a word (e.g. noun, verb, etc.). On by default
Syncons Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). On by default
Ancestors More abstract concepts related to syncons. On by default
Numeric words: Toggle numeric as a feature in the word vector. Off by default
Use suffix of a word Off by default
Use prefix of a word Off by default

Rules selection

Options for selecting best rules

Parameter Description
Fine-tuning rules selecting only the most significant ones On by default
Number of rules selection steps Number between 20 and 100. 50 by default
Fraction of validation split Number between 0.1 and 0.9. 0.2 by default
Activate rules pruning Off by default
Max number of rules to select Enabled only if the previous parameter is switched off. Number between 1 and 1000. 100 by default