Experiments engine setup parameters
The parameters used when starting a categorization experiment or an extraction experiment are listed in the following tables.
The advanced parameters are included, they are marked with a blue caption in italics. If you want to hide them, select Hide advanced parameters.
Categorization
Auto-ML categorization
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated documents (strict)
- All annotated documents. Set by default
Model type
Parameter | Description |
---|---|
Linear SVM | Linear SVM classifier model: standard Support Vector Machine using linear regression margins |
Probabilistic SVM | Probabilistic SVM classifier model: Support Vector Machine using probability distribution prediction scores |
Custom kernel SVM | Custom kernel SVM classifier model: Support Vector Machine using custom kernel |
SGD | SGD classifier model: Stochastic Gradient Descent learning mechanism on a Linear SVM model |
GBoost | GBoost classifier model: Gradient boosting technique of stacking decision tree models, sequentially training on residual errors |
XGBoost | XGBoost classifier model: Extreme gradient boosting technique using more accurate approximations over a GBoost model |
Random Forest | Random Forest classifier model: ensemble of decision trees using combined majority predictions |
Logistic Regression | Logistic Regression classifier model: logistic function used to model probabilities of possible outcomes |
Multinomial Naive Bayes | Multinomial Naïve Bayes classifier model: standard Naïve Bayes model using conditional probability of words to determine predictions |
Complement Naive Bayes | Complement Naïve Bayes classifier model: multinomial Naïve Bayes model improved by using statistics from the complement of each class to compute model weights |
Note
If you select one of the following models:
- Probabilistic SVM
- GBoost
- XGBoost
- Random Forest
- Logistic Regression
- Multinomial Naive Bayes
- Complement Naive Bayes
the Auto ML parameters and the F-Beta parameters won't be available.
If you select the Custom kernel SVM, the Auto ML parameters won't be available.
Auto ML parameters
Parameter | Description |
---|---|
Enable custom setup | Disabled by default. Enable it to set the other parameters |
Number of training iterations for the model | Number between 20 and 100 (30 by default) |
Number of data splits for cross-validation | Number between 2 and 10 (3 by default) |
Call back function for stopping the self-tuning process | You can select: You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time (in minutes) deadline for the call back stop function | Number between 15 and 120 (30 by default) |
Note
If you select Stop based on best scoring evaluation from the Call back functions for stopping the self-tuning process parameter, the Total time stop function: target time minutes won't be available.
Feature space
Data elements to use in feature vector creation.
Parameter | Description |
---|---|
Word form: Occurrence of a keyword | Set on Use by default |
Word base form (Lemma): Base form of a word (lemma) (e.g. “run” for “running” or “ran”) | Set on Use by default |
Main lemma: Document-level most representative lemmas | Set on Don't use by default |
Word base form stem: “Stem of a word (e.g. “intern” is the stem of “international”) | Set on Use by default |
Sub-words: “Unit smaller than a word (e.g. morphemes, stems and endings, roots, etc.) | Set on Use by default |
Entities: Entities (e.g. persons, organizations, etc.) | Set on Use by default |
Syncons: Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”) | Set on Use by default |
Main Syncons: Document-level most representative syncons | Set on Use by default |
Syncon Topics: Generalized main subjects being discussed (e.g. “mammal” as a concept in “the tiger is a mammal” has topic “zoology”) | Set on Use by default |
Main Topics: Document-level most representative topics | Set on Use by default |
Knowledge Label: Pre-defined parent syncon (e.g. “legal action” is the knowledge label for “moratorium”) | Set on Don't use by default |
Knowledge Graph relations: Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.) | Set on Use by default |
Hyperparameters
Parameter | Description |
---|---|
SVM C parameter: penalty for misclassifications | The following values are set by default: - 0.001 - 0.01 - 0.1 - 0.3 - 0.5 - 0.8 - 1 |
Class weight | Balanced by default |
F-Beta
Parameter | Description |
---|---|
Enable F Beta optimization (tuning balance between precision and recall) | Disabled by default. Enable it to set the other parameters. |
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. | Number between 0 and 2 (1 by default) |
Explainable categorization
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated documents (strict)
- Prefer validated documents (Selected by default)
- Prefer annotated documents
- Random selection
Generic parameters
Parameter | Description |
---|---|
Enable "onCategorizer" optimization | Enabled by default |
Enable "strict" hierarchical mode | Disabled by default |
Enable "single label" mode | Disabled by default |
Note
If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters are not available.
Rules Generation
Parameter | Description |
---|---|
Enable generation of syncon based rules | Enabled by default |
Enable generation of ancestor based rules | Disabled by default |
Max number of rules for each taxonomy category | Number between 5 and 1000, 200 by default |
Min number of annotated documents for a category, to enable rules generation | Number between 2 and 1000, 5 by default |
Max number of rules in which any single item can participate | Number between 2 and 200, 40 by default |
Note
If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.
Fine tuning
Parameter | Description |
---|---|
Desired Clean level | Automatic by default |
Default clean level (initial value if auto is selected) | 10 by default |
Desired filter sequence | Automatic by default |
Default filter sequence (initial value if auto is selected) | 40, 80, 90, 90, 90 by default |
Enable conservative clean | Disabled by default |
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) | Number greater than -1 (set by default) |
F-Beta
Parameter | Description |
---|---|
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. | Number between 0 and 2, 0,75 by default |
Bootstrapped Studio Project
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated documents (strict)
- Prefer validated documents (Selected by default)
- Prefer annotated documents
- Random selection
Rules Generation
Parameter | Description |
---|---|
Enable generation of syncon based rules | Enabled by default |
Enable generation of ancestor based rules | Disabled by default |
Max number of items in each rule | Number between 1 and 3, 2 by default |
Max number of rules for each taxonomy category | Number between 5 and 50, 20 by default |
Min number of annotated documents for a category, to enable rules generation | Number between 2 and 100, 5 by default |
Max number of rules in which any single item can participate | Number between 2 and 20, 5 by default |
Max number of element in a single item of a rule | Number between 1 and 3, 2 by default |
Note
If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.
Fine tuning
Parameter | Description |
---|---|
Desired Clean level | 10 by default |
Desired filter sequence | 40, 80, 90 by default |
Extraction
Explainable Extraction
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated documents (strict)
- Prefer validated documents (Selected by default)
- Prefer annotated documents
- Random selection
Support, confidences and tolerance parameters
Parameter | Description |
---|---|
Max rule length | Number between 1 and 10. 3 by default |
Confidence treshold for accepting a rule | Number between 0.001 and 1. 0.8 by default |
Tolerance | Number between 0.0001 and 0.1. 0.02 by default |
Active feature options
Parameter | Description |
---|---|
Left context window size | Number between 1 and 10. 3 by default |
Right context window size | Number between 1 and 10. 3 by default |
Use token raw form | On by default |
Use lemma | On by default |
Use POS type | On by default |
Use syncon | On by default |
Use ancestor | On by default |
Auto-ML Extraction
Training docs
-
Select an annotated library to perform an experiment in Training library.
-
Select the Training documents selection policy among:
- Only validated documents (strict)
- Prefer validated documents (Selected by default)
- Prefer annotated documents
- Random selection
Model type
Parameter | Description |
---|---|
CRF model | CRF entity extraction model: Conditional Random Fields probabilistic model designed for sequence labeling |
SVM sliding window | Support Vector Machine using a sequence tagging approach translated into a local linear SVC classifier |
Note
If you select the SVM sliding window model, the Auto ML parameters, the Hyperparameters and the F-Beta parameters won't be available.
Auto-ML parameters
Parameter | Description |
---|---|
Enable custom setup | Disabled by default. Enable it to set the other parameters |
Number of training iterations for the model | Number between 20 and 100 (30 by default) |
Number of data splits for cross-validation | Number between 2 and 10 (3 by default) |
Call back function for stopping the self-tuning process | You can select: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time (in minutes) deadline for the call back stop function | Number between 15 and 120 (30 by default) |
Note
If you select Stop based on best scoring evaluation, the Target time (in minutes) deadline for the call back stop function parameter won't be available.
Feature space
Parameter | Description |
---|---|
Word base form (Lemma): Base form of a word (lemma) (e.g. “run” for “running” or “ran”) | Set on Don't use by default |
Logic dependencies: Relationships and dependencies of the word (e.g. “subject” – “relationship type” – “object” relationships) | Set on Use by default |
Word Part-of-Speech: Part-of-speech of a word (e.g. noun, verb, etc.) | Set on Use by default |
Collocations: Combination of words frequently used together which have a specific meaning (e.g. “regular exercise” or “to take a risk”) | Set on Use by default |
Phrases: Combination of words that together create a singular meaning (e.g. “to look after” or “on the table”) | Set on Don't use by default |
Syncons: Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”) | Set on Use by default |
Known Concepts: Specific meaning of a term that is available in the Expert.ai Knowledge Graph (e.g. “Italy” is a specific country, “World Cup” is a specific football tournament) | Set on Don't use by default |
Entities: Entities (e.g. persons, organizations, etc.) | Set on Use by default |
Knowledge Graph relations: Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.) | Set on Don't use by default |
Title case words: Toggle title case as a feature in the word vector | Set on Use by default |
Upper case word: Toggle uppercase as a feature in the word vectors | Set on Use by default |
Digit words: Toggle digit as a feature in the word vector | Set on Use by default |
Mixed case words: Toggle mixed case as a feature in the word vector | Set on Use by default |
Alpha Numeric words: Toggle alpha numeric as a feature in the word vector | Set on Use by default |
Alphabetic words: Toggle alphabetic as a feature in the word vector | Set on Use by default |
Numeric words: Toggle numeric as a feature in the word vector | Set on Use by default |
Decimal number words: Toggle decimal number as a feature in the word vector | Set on Use by default |
Hyperparameters
Parameter | Description |
---|---|
CRF c1 regularization coefficient | The following values are set by default: - 0 - 0.001 - 0.01 - 0.1 - 0.5 - 1 |
CRF c2 regularization coefficient | The following values are set by default: - 0.001 - 0.01 - 0.1 - 0.5 - 1 |
F-Beta
Parameter | Description |
---|---|
Enable F Beta optimization (tuning balance between precision and recall) | Disabled by default. Enable it to set the parameter below |
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. | Number between 0 and 5 (1 by default) |