Experiments engine setup parameters
The parameters used when starting a categorization experiment or an extraction experiment are listed in the following tables.
The advanced parameters are included, they are marked with a blue caption in italics. If you want to hide them, select Hide advanced parameters.
Categorization
Auto-ML categorization
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection
Model type
Parameter | Description |
---|---|
Linear SVM | Linear SVM classifier model: standard Support Vector Machine using linear regression margins |
Probabilistic SVM | Probabilistic SVM classifier model: Support Vector Machine using probability distribution prediction scores |
Custom kernel SVM | Custom kernel SVM classifier model: Support Vector Machine using custom kernel |
SGD | SGD classifier model: Stochastic Gradient Descent learning mechanism on a Linear SVM model |
GBoost | GBoost classifier model: Gradient boosting technique of stacking decision tree models, sequentially training on residual errors |
XGBoost | XGBoost classifier model: Extreme gradient boosting technique using more accurate approximations over a GBoost model |
Random Forest | Random Forest classifier model: ensemble of decision trees using combined majority predictions |
Logistic Regression | Logistic Regression classifier model: logistic function used to model probabilities of possible outcomes |
Multinomial Naive Bayes | Multinomial Naïve Bayes classifier model: standard Naïve Bayes model using conditional probability of words to determine predictions |
Complement Naive Bayes | Complement Naïve Bayes classifier model: multinomial Naïve Bayes model improved by using statistics from the complement of each class to compute model weights |
Warning
-
If you select one of the following models:
- Probabilistic SVM
- GBoost
- XGBoost
- Random Forest
- Logistic Regression
- Multinomial Naive Bayes
- Complement Naive Bayes
the Auto ML parameters and the F-Beta parameters won't be available.
-
If you select the Custom kernel SVM model, the Auto ML parameters won't be available.
-
If you select more than one model type, the Feature space parameters, Hyper parameters, F-Beta parameters and Auto ML parameters won't be available.
Problem definition
Parameter | Description |
---|---|
Enable strict "single label" mode | Enable strict single label mode (off by default) |
Enable strict "Sub document categorization" compatibility mode | Activating this mode will enable the trained model to use annotated strings of text to predict classes ("sub document categorization"). Switched off by default. At the current moment, annotation and testing must be done outside of the Platform. |
Warning
If you switch on the Enable strict "single label" mode parameter, the F-Beta parameters will not be available.
Feature space
Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameters combination by Platform.
If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:
Parameter | Description |
---|---|
Word form | Occurrence of a keyword. Set on Use by default. |
Word base form (Lemma) | Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Use by default |
Main lemma | Document-level most representative lemmas. Set on Don't use by default |
Word base form stem | Stem of a word (e.g. “intern” is the stem of “international”). Set on Use by default |
Sub-words | Unit smaller than a word (e.g. morphemes, stems and endings, roots, etc.). Set on Use by default |
Entities | Entities (e.g. persons, organizations, etc.). Set on Use by default |
Syncons | Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default |
Main Syncons | Document-level most representative syncons. Set on Use by default |
Syncon Topics | Generalized main subjects being discussed (e.g. “mammal” as a concept in “the tiger is a mammal” has topic “zoology”). Set on Use by default |
Main Topics | Document-level most representative topics. Set on Use by default |
Knowledge Label | Pre-defined parent syncon (e.g. “legal action” is the knowledge label for “moratorium”). Set on Don't use by default |
Knowledge Graph relations | Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Use by default |
Use word embeddings | Set on Don't use by default |
Hyperparameters
Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.
If Activate Auto-ML on every parameter is off:
- Select Enable strict "Sub document categorization" compatibility mode to employ the trained model in "Sub document categorization" mode (by exporting it). Off by default.
- Select Enable strict "single label" mode to activate the single label mode. Off by default.
SVM C parameter: penalty for misclassifications | . |
---|---|
0.001 | 0.01 |
0.05 | 0.1 |
0.2 | 0.3 |
0.4 | 0.5 |
0.6 | 0.7 |
0.8 | 0.9 |
1 (by default) | 5 |
Class weight | . |
---|---|
Balanced (by default) | None |
F-Beta
Parameter | Description |
---|---|
Enable F Beta optimization (tuning balance between precision and recall) | Disabled by default. Enable it to set the other parameters. |
Target F Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2 (1 by default) |
Auto ML parameters
Machine Learning automatic self-tuning process parameters
AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.
Parameter | Description |
---|---|
Number of training iterations for the AutoML algorithm | Number between 20 and 100 (30 by default) |
Number of data splits for cross-validation of AutoML algorithm | Number between 2 and 10 (3 by default) |
Call back function for stopping the AutoML self-tuning process | You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time deadline for the AutoML call back stop function (minutes) | Number between 15 and 120 (30 by default) |
Note
If you select Stop based on best scoring evaluation from the Call back function for stopping the AutoML self-tuning process parameter, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.
Explainable categorization
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection
Generic parameters
Parameter | Description |
---|---|
Enable "onCategorizer" optimization | Enabled by default |
Enable "strict" hierarchical mode | Disabled by default |
Enable "single label" mode | Disabled by default |
Note
If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters are not available.
Rules Generation
Parameter | Description |
---|---|
Enable generation of syncon based rules | Enabled by default |
Enable generation of ancestor based rules | Disabled by default |
Max number of items in each rule | Number between 2 and 4, 3 by default |
Max number of rules for each taxonomy category | Number between 5 and 1000, 200 by default |
Min number of annotated documents for a category, to enable rules generation | Number between 2 and 1000, 5 by default |
Max number of rules in which any single item can participate | Number between 2 and 200, 40 by default |
Note
If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.
Fine tuning
Parameter | Description |
---|---|
Desired Clean level | Automatic by default |
Default clean level (initial value if auto is selected) | 10 by default |
Desired filter sequence | Automatic by default |
Default filter sequence (initial value if auto is selected) | 40, 80, 90, 90, 90 by default |
Enable conservative clean | Disabled by default |
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) | Number greater than -1 (set by default) |
F-Beta
Parameter | Description |
---|---|
Target F Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2, 0,75 by default |
Bootstrapped Studio Project
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection
Rules Generation
Parameter | Description |
---|---|
Enable generation of syncon based rules | Enabled by default |
Enable generation of ancestor based rules | Disabled by default |
Max number of items in each rule | Number between 1 and 3, 2 by default |
Max number of rules for each taxonomy category | Number between 5 and 50, 20 by default |
Min number of annotated documents for a category, to enable rules generation | Number between 2 and 100, 5 by default |
Max number of rules in which any single item can participate | Number between 2 and 20, 5 by default |
Max number of element in a single item of a rule | Number between 1 and 3, 2 by default |
Note
If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.
Fine tuning
Parameter | Description |
---|---|
Desired Clean level | 10 by default |
Desired filter sequence | 40, 80, 90 by default |
Extraction
Auto-ML Extraction
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection
Model type
Parameter | Description |
---|---|
CRF | CRF entity extraction model: Conditional Random Fields probabilistic model designed for sequence labeling |
SVM sliding window | Support Vector Machine using a sequence tagging approach translated into a local linear SVC classifier |
Note
-
If you select the SVM sliding window model, the Hyperparameters and the F-Beta parameters won't be available.
-
If you select more than one model type the Feature space parameters, Hyperparameters, F-Beta parameters and Auto ML parameters won't be available.
Feature space
Switch-off or switch-on Automatic features selection to enable or disable the automatic selection of the best parameter combination by Platform.
If Automatic features selection is switched-off, select the status, Use or Don't use, of the following parameters:
Parameter | Description |
---|---|
Word base form (Lemma) | Base form of a word (lemma) (e.g. “run” for “running” or “ran”). Set on Don't use by default |
Logic dependencies | Relationships and dependencies of the word (e.g. “subject” – “relationship type” – “object” relationships). Set on Use by default |
Word Part-of-Speech | Part-of-speech of a word (e.g. noun, verb, etc.). Set on Use by default |
Collocations | Combination of words frequently used together which have a specific meaning (e.g. “regular exercise” or “to take a risk”). Set on Use by default |
Phrases | Combination of words that together create a singular meaning (e.g. “to look after” or “on the table”). Set on Don't use by default |
Syncons | Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). Set on Use by default |
Known Concepts | Specific meaning of a term that is available in the Expert.ai Knowledge Graph (e.g. “Italy” is a specific country, “World Cup” is a specific football tournament). Set on Don't use by default |
Entities | Entities (e.g. persons, organizations, etc.). Set on Use by default |
Knowledge Graph relations | Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.). Set on Don't use by default |
Title case words | Toggle title case as a feature in the word vector. Set on Use by default |
Upper case word | Toggle uppercase as a feature in the word vectors. Set on Use by default |
Digit words | Toggle digit as a feature in the word vector. Set on Use by default |
Mixed case words | Toggle mixed case as a feature in the word vector. Set on Use by default |
Alpha Numeric words | Toggle alpha numeric as a feature in the word vector. Set on Use by default |
Alphabetic words | Toggle alphabetic as a feature in the word vector. Set on Use by default |
Numeric words | Toggle numeric as a feature in the word vector. Set on Use by default |
Decimal number words | Toggle decimal number as a feature in the word vector. Set on Use by default |
Hyperparameters
Select Activate Auto-ML on every parameter to enable automatic parameter configuration. Deselected by default.
CRF c1 regularization coefficient | . |
---|---|
0 | 0.00001 |
0.0001 | 0.001 (By default) |
0.01 | 0.05 |
0.1 | 0.3 |
0.5 | 0.8 |
1 | 2 |
5 | 10 |
100 |
CRF c2 regularization coefficient | . |
---|---|
0.00001 | 0.0001 |
0.001 (By default) | 0.01 |
0.05 | 0.1 |
0.3 | 0.5 |
2 | 5 |
10 | 100 |
F-Beta
Parameter | Description |
---|---|
Enable F Beta optimization (tuning balance between precision and recall) | Disabled by default. Enable it to set the parameter below |
Target F Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 5 (1 by default) |
Auto-ML parameters
Machine Learning automatic self-tuning process parameters
AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.
Parameter | Description |
---|---|
Number of training iterations for the AutoML algorithm | Number between 20 and 100 (30 by default) |
Number of data splits for cross-validation of AutoML algorithm | Number between 2 and 10 (3 by default) |
Call back function for stopping the AutoML self-tuning process | You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time deadline for the AutoML call back stop function (minutes) | Number between 15 and 120 (30 by default) |
Note
If you select Stop based on best scoring evaluation, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.
Explainable Extraction
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated annotated documents (strict)
- Only validated or annotated documents (strict) (Selected by default)
- Prefer validated documents
- Prefer annotated documents
- Random selection
Rules generation
Parameter | Description |
---|---|
Maximum number of conditions for any given rule | Number between 1 and 5. 3 by default |
Enable automatic minimum support setup | Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default |
Custom minimum support threshold | Enabled only if the previous parameter is switched off. Number greater than 2. 5 by default |
Enable automatic minimum support setup | Minimum support is the number of times a rule must match in a training set to be generated. Switched on by default |
Custom minimum confidence to explore a rule | Enabled only if the previous parameter is switched off. Number between 0.001 and 0.2. 0.05 by default |
Minimum acceptance confidence threshold | Number between 0.2 and 0.95. 0.6 by default |
Minimum confidence improvement for adding a new condition to a rule | Number between 0.01 and 0.2. 0.01 by default |
Enable concatenation of contiguous extractions | Switched off by default |
Feature options
Parameter | Description |
---|---|
Window size (in tokens) to the left of the token being predicted | Number between 0 and 5. 3 by default |
Window size (in tokens) to the right of the token being predicted | Number between 0 and 5. 3 by default |
Minimum document frequency | Number greater than 1. 2 by default |
Raw word form | The word itself. On by default |
Word base form (Lemma) | Base form of a word (lemma) (e.g. “run” for “running” or “ran”). On by default |
Word Part-of-Speech | Part-of-speech of a word (e.g. noun, verb, etc.). On by default |
Syncons | Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”). On by default |
Ancestors | More abstract concepts related to syncons. On by default |
Numeric words: | Toggle numeric as a feature in the word vector. Off by default |
Use suffix of a word | Off by default |
Use prefix of a word | Off by default |
Rules selection
Options for selecting best rules
Parameter | Description |
---|---|
Fine-tuning rules selecting only the most significant ones | On by default |
Number of rules selection steps | Number between 20 and 100. 50 by default |
Fraction of validation split | Number between 0.1 and 0.9. 0.2 by default |
Activate rules pruning | Off by default |
Max number of rules to select | Enabled only if the previous parameter is switched off. Number between 1 and 1000. 100 by default |