Skip to content

Experiments engine setup parameters

The parameters used when starting a categorization experiment or an extraction experiment are listed in the following tables.

The advanced parameters are included, they are marked with a blue caption in italics. If you want to hide them, select Hide advanced parameters.

Categorization

Auto-ML categorization

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated documents (strict)
    • All annotated documents. Set by default

Model type

Parameter Description
Linear SVM Linear SVM classifier model: standard Support Vector Machine using linear regression margins
Probabilistic SVM Probabilistic SVM classifier model: Support Vector Machine using probability distribution prediction scores
Custom kernel SVM Custom kernel SVM classifier model: Support Vector Machine using custom kernel
SGD SGD classifier model: Stochastic Gradient Descent learning mechanism on a Linear SVM model
GBoost GBoost classifier model: Gradient boosting technique of stacking decision tree models, sequentially training on residual errors
XGBoost XGBoost classifier model: Extreme gradient boosting technique using more accurate approximations over a GBoost model
Random Forest Random Forest classifier model: ensemble of decision trees using combined majority predictions
Logistic Regression Logistic Regression classifier model: logistic function used to model probabilities of possible outcomes
Multinomial Naive Bayes Multinomial Naïve Bayes classifier model: standard Naïve Bayes model using conditional probability of words to determine predictions
Complement Naive Bayes Complement Naïve Bayes classifier model: multinomial Naïve Bayes model improved by using statistics from the complement of each class to compute model weights

Note

If you select one of the following models:

  • Probabilistic SVM
  • GBoost
  • XGBoost
  • Random Forest
  • Logistic Regression
  • Multinomial Naive Bayes
  • Complement Naive Bayes

the Auto ML parameters and the F-Beta parameters won't be available.

If you select the Custom kernel SVM, the Auto ML parameters won't be available.

Auto ML parameters

Parameter Description
Enable custom setup Disabled by default. Enable it to set the other parameters
Number of training iterations for the model Number between 20 and 100 (30 by default)
Number of data splits for cross-validation Number between 2 and 10 (3 by default)
Call back function for stopping the self-tuning process You can select: You can select:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time (in minutes) deadline for the call back stop function Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation from the Call back functions for stopping the self-tuning process parameter, the Total time stop function: target time minutes won't be available.

Feature space

Data elements to use in feature vector creation.

Parameter Description
Word form: Occurrence of a keyword Set on Use by default
Word base form (Lemma): Base form of a word (lemma) (e.g. “run” for “running” or “ran”) Set on Use by default
Main lemma: Document-level most representative lemmas Set on Don't use by default
Word base form stem: “Stem of a word (e.g. “intern” is the stem of “international”) Set on Use by default
Sub-words: “Unit smaller than a word (e.g. morphemes, stems and endings, roots, etc.) Set on Use by default
Entities: Entities (e.g. persons, organizations, etc.) Set on Use by default
Syncons: Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”) Set on Use by default
Main Syncons: Document-level most representative syncons Set on Use by default
Syncon Topics: Generalized main subjects being discussed (e.g. “mammal” as a concept in “the tiger is a mammal” has topic “zoology”) Set on Use by default
Main Topics: Document-level most representative topics Set on Use by default
Knowledge Label: Pre-defined parent syncon (e.g. “legal action” is the knowledge label for “moratorium”) Set on Don't use by default
Knowledge Graph relations: Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.) Set on Use by default

Hyperparameters

Parameter Description
SVM C parameter: penalty for misclassifications The following values are set by default:
- 0.001
- 0.01
- 0.1
- 0.3
- 0.5
- 0.8
- 1
Class weight Balanced by default

F-Beta

Parameter Description
Enable F Beta optimization (tuning balance between precision and recall) Disabled by default. Enable it to set the other parameters.
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2 (1 by default)

Explainable categorization

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated documents (strict)
    • Prefer validated documents (Selected by default)
    • Prefer annotated documents
    • Random selection

Generic parameters

Parameter Description
Enable "onCategorizer" optimization Enabled by default
Enable "strict" hierarchical mode Disabled by default
Enable "single label" mode Disabled by default

Note

If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters are not available.

Rules Generation

Parameter Description
Enable generation of syncon based rules Enabled by default
Enable generation of ancestor based rules Disabled by default
Max number of rules for each taxonomy category Number between 5 and 1000, 200 by default
Min number of annotated documents for a category, to enable rules generation Number between 2 and 1000, 5 by default
Max number of rules in which any single item can participate Number between 2 and 200, 40 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter Description
Desired Clean level Automatic by default
Default clean level (initial value if auto is selected) 10 by default
Desired filter sequence Automatic by default
Default filter sequence (initial value if auto is selected) 40, 80, 90, 90, 90 by default
Enable conservative clean Disabled by default
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) Number greater than -1 (set by default)

F-Beta

Parameter Description
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 2, 0,75 by default

Bootstrapped Studio Project

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated documents (strict)
    • Prefer validated documents (Selected by default)
    • Prefer annotated documents
    • Random selection

Rules Generation

Parameter Description
Enable generation of syncon based rules Enabled by default
Enable generation of ancestor based rules Disabled by default
Max number of items in each rule Number between 1 and 3, 2 by default
Max number of rules for each taxonomy category Number between 5 and 50, 20 by default
Min number of annotated documents for a category, to enable rules generation Number between 2 and 100, 5 by default
Max number of rules in which any single item can participate Number between 2 and 20, 5 by default
Max number of element in a single item of a rule Number between 1 and 3, 2 by default

Note

If Enable generation of syncon based rules is disabled, Enable generation of ancestor based rules will not be available.

Fine tuning

Parameter Description
Desired Clean level 10 by default
Desired filter sequence 40, 80, 90 by default

Extraction

Explainable Extraction

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated documents (strict)
    • Prefer validated documents (Selected by default)
    • Prefer annotated documents
    • Random selection

Support, confidences and tolerance parameters

Parameter Description
Max rule length Number between 1 and 10. 3 by default
Confidence treshold for accepting a rule Number between 0.001 and 1. 0.8 by default
Tolerance Number between 0.0001 and 0.1. 0.02 by default

Active feature options

Parameter Description
Left context window size Number between 1 and 10. 3 by default
Right context window size Number between 1 and 10. 3 by default
Use token raw form On by default
Use lemma On by default
Use POS type On by default
Use syncon On by default
Use ancestor On by default

Auto-ML Extraction

Training docs

  • Select an annotated library to perform an experiment in Training library.

  • Select the Training documents selection policy among:

    • Only validated documents (strict)
    • Prefer validated documents (Selected by default)
    • Prefer annotated documents
    • Random selection

Model type

Parameter Description
CRF model CRF entity extraction model: Conditional Random Fields probabilistic model designed for sequence labeling
SVM sliding window Support Vector Machine using a sequence tagging approach translated into a local linear SVC classifier

Note

If you select the SVM sliding window model, the Auto ML parameters, the Hyperparameters and the F-Beta parameters won't be available.

Auto-ML parameters

Parameter Description
Enable custom setup Disabled by default. Enable it to set the other parameters
Number of training iterations for the model Number between 20 and 100 (30 by default)
Number of data splits for cross-validation Number between 2 and 10 (3 by default)
Call back function for stopping the self-tuning process You can select:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time (in minutes) deadline for the call back stop function Number between 15 and 120 (30 by default)

Note

If you select Stop based on best scoring evaluation, the Target time (in minutes) deadline for the call back stop function parameter won't be available.

Feature space

Parameter Description
Word base form (Lemma): Base form of a word (lemma) (e.g. “run” for “running” or “ran”) Set on Don't use by default
Logic dependencies: Relationships and dependencies of the word (e.g. “subject” – “relationship type” – “object” relationships) Set on Use by default
Word Part-of-Speech: Part-of-speech of a word (e.g. noun, verb, etc.) Set on Use by default
Collocations: Combination of words frequently used together which have a specific meaning (e.g. “regular exercise” or “to take a risk”) Set on Use by default
Phrases: Combination of words that together create a singular meaning (e.g. “to look after” or “on the table”) Set on Don't use by default
Syncons: Conceptual meaning of a word or phrase (e.g. “to work out” means “to exercise”) Set on Use by default
Known Concepts: Specific meaning of a term that is available in the Expert.ai Knowledge Graph (e.g. “Italy” is a specific country, “World Cup” is a specific football tournament) Set on Don't use by default
Entities: Entities (e.g. persons, organizations, etc.) Set on Use by default
Knowledge Graph relations: Attribute the hierarchical relation nodes as added meaning to the word (e.g. “dentist” is also “medical specialist”, “doctor”, “professional”, etc.) Set on Don't use by default
Title case words: Toggle title case as a feature in the word vector Set on Use by default
Upper case word: Toggle uppercase as a feature in the word vectors Set on Use by default
Digit words: Toggle digit as a feature in the word vector Set on Use by default
Mixed case words: Toggle mixed case as a feature in the word vector Set on Use by default
Alpha Numeric words: Toggle alpha numeric as a feature in the word vector Set on Use by default
Alphabetic words: Toggle alphabetic as a feature in the word vector Set on Use by default
Numeric words: Toggle numeric as a feature in the word vector Set on Use by default
Decimal number words: Toggle decimal number as a feature in the word vector Set on Use by default

Hyperparameters

Parameter Description
CRF c1 regularization coefficient The following values are set by default:
- 0
- 0.001
- 0.01
- 0.1
- 0.5
- 1
CRF c2 regularization coefficient The following values are set by default:
- 0.001
- 0.01
- 0.1
- 0.5
- 1

F-Beta

Parameter Description
Enable F Beta optimization (tuning balance between precision and recall) Disabled by default. Enable it to set the parameter below
Target F Beta: A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Number between 0 and 5 (1 by default)