Experiment parameters

The parameters that can be set in the categorization, extraction and thesaurus experiments are described below.

Training documents selection

The selection of the documents to use to train the model is the first step of all the experiments that generate models. Its parameters are:

Training documents selection policy

This parameter applies to all the categorization and extraction experiments that generate a model.
It determines which documents from the training set are used for training. Possible values are:
- Only validated annotated documents (strict): only annotated documents that have also been validated will be used to train the model.
- Only validated or annotated documents (strict): only documents that are annotated or validated will be used to train the model.
- Prefer validated documents: in case of sub-sampling, validated documents will be preferred over non-validated documents.
- Prefer annotated documents: in case of sub-sampling, annotated documents will be preferred over non-annotated documents.
- Random selection: the documents used to train the model will be randomly selected from the library.
Enable subsampling using random selection strategy

This parameter applies to ML categorization experiments and to Auto-ML extraction experiments.

When turned on, only a randomly selected subset of the training library is used to train the model. The Subsampling max documents parameter (see below) determines the size of the subset.
Subsampling max documents

This is a sub-parameter of Enable subsampling using random selection strategy (see above). It determines the size of the subset.

Training windows

The following parameters apply to ML extraction experiments and affect the areas of text around annotations (windows) that are considered in the training process.

Ignore non-annotated areas

When turned on, for non-validated documents, the only portions of text used to train the models are those around the annotations. The size of the area around annotations is determined by the Annotated area window size parameter (see below). For validated documents, instead, all the text is considered.
Annotated area window size

This is a sub-parameter of Ignore non-annotated areas (see above).

It's the size of the area around annotations to consider for training. It's expressed in sentences before and after the sentence containing the annotation, so for example value 2 means the area includes two sentences before and two sentences after.
Enable Negative Sub-sampling

It's alternative to Ignore non-annotated areas (see above).

When turned on, the training algorithm smartly chooses some non-annotated areas around annotations and excludes them from training in order to reduce noise.

Problem definition

Problem definition is a step of ML categorization experiments. Its parameters are:

Enable strict "single label" mode

When turned on, the model predicts at most one category for each document. When off, the model can detect any number of categories.
Enable strict "Sub document categorization" compatibility mode

When turned on, the model can predict categories over portions of the document (sub document categorization).

Warning

Annotation and testing for sub document categorization must be performed with external tools, contact your technical support for more information.

Feature space

Feature space parameters affect ML categorization and extraction experiments and determine which features of the text are used to train the model.

In Auto ML experiments, Platform can automatically decide the features to use. This behavior is activated when the Automatic features selection option is turned on.

Available features that can be used or not are:

Feature	Description	Experiment type
Alpha Numeric words	Words consisting of both letters and digits	Extraction
Alphabetic words	Words consisting only of letters	Extraction
Collocations	Combined words, the combination has its own meaning (e.g. credit card or take a risk)	Extraction
Decimal number words	Words representing decimal numbers	Extraction
Digit words	Words consisting only of digits	Extraction
Entities	Named entities like people, places and organizations	Categorization and extraction
Knowledge Graph relations	Knowledge Graph ascending concepts, along ISA-type relationships, of the concept corresponding to text words (e.g. dentist is a medical specialist, which is a doctor, which is a professional)	Categorization and extraction
Knowledge Label	Main lemma of the concept which, inside the Knowledge Graph, in an ISA relation, is the parent of the concept corresponding to the text word (e.g. if the text word is moratorium, its parent concept's label is legal action )	Categorization
Known Concepts	Knowledge Graph concepts (syncons) for text words that are well known proper nouns (e.g. World Cup, United States)	Extraction
Logic dependencies	Syntactic relationships between text words (e.g. subject-verb-object)	Extraction
Main lemma	Base forms (lemmas) of the most important words of the text	Categorization
Main Syncons	Knowledge Graph concepts (syncons) of the most important words of the text	Categorization
Main Topics	Knowledge Graph topics the text is primarily about	Categorization
Mixed case words	Words consisting of both uppercase and lowercase letters	Extraction
Numeric words	Words that represent numbers	Extraction
Phrases	Phrases, i.e. one or more words that form a meaningful grammatical unit	Extraction
Sub-words	Parts of a word like morphemes, stems and roots	Categorization
Syncon Topics	The topics that in the Knowledge Graph are attributed to the concepts (syncon) corresponding to the text words	Categorization
Syncons	Knowledge Graph concepts (syncon) corresponding to text words	Categorization and extraction
Title case words	Capitalized words	Extraction
Upper case word	Words consisting only of uppercase letters	Extraction
Use word embeddings	Static word embeddings	Categorization
Word base form (Lemma)	Base form (lemma) of text words (e.g. run for running and ran)	Categorization and extraction
Word base form stem	Stem of text words (e.g. intern for international)	Categorization
Word form	Text word exactly as written in the text	Categorization
Word Part-of-Speech	Part-of-speech of a word (e.g. noun, verb, adjective)	Extraction

Other feature space parameters are:

For both categorization and extraction experiments:
- Max features: the maximum number of text features to use to train the model. Value 0 means all available features.
For categorization experiments:
- Maximum N for N-grams: the N to compute N-grams for stem and keywords features. All the N-Grams up to N will be computed, thus value 1 means that only Unigrams will be computed while with 3 Unigrams, Bigrams and Trigrams will be used.
- Min DF: the minimum number of documents in which a feature must appear to be considered for training.
- Max DF: the maximum percentage of documents in which a feature can appear to be considered for training.
For extraction experiments:
- Min WF: the minimum number of windows (areas around annotations) per document in which a feature must appear in order to be considered for training.
- Max WF: the maximum percentage of windows per document in which a feature can appear to be considered for training.

Hyperparameters

Hyperparameters apply to ML categorization and extraction experiments.

Alpha regularizer

Applies to these categorization model types:
- Multinomial Naive Bayes
- Complement Naive Bayes
Regularization parameter, smoothing factor on term counts. Large values increase the regularization.
C parameter: penalty for misclassifications

Applies to these categorization model types:
- Linear SVM
- Passive Aggressive
- Probabilistic SVM
The C parameter is a regularization (or generalization) parameter for training data.
It is a number and it is used to prevent over-fitting and under-fitting.
If you have a big training set and you consider it representative, the parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small training set, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization.
Class weight

Applies to these categorization model types:
- Linear SVM
- Probabilistic SVM
- Custom kernel SVM
- SGD
- Random Forest
- Logistic Regression
Regularization parameter to balance categories. Possible values are:
- Balanced
- None
If one category is preponderant over all the others in the training set, balancing categories prevent unbalanced predictions for less represented classes. If the training model is highly representative, balancing makes the model a little less performing. If, on the other hand, the training model is not very representative and balancing is not enabled, model performance is poor.
CRF c1 regularization coefficient

Applies to the CRF extraction model type.

Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value.
CRF c2 regularization coefficient

Applies to the CRF extraction model type.

Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms.
CRF Forced use of all possible states

Applies to the CRF extraction model type.
When enabled, the algorithm generates state features for all the combinations of attributes and labels and that possibly don't occur in the training data (negative state features). This may improve labeling accuracy but slow down the training process.
CRF Forced use of all possible transitions

Applies to the CRF extraction model type.
When enabled, the algorithm generates transition features for all the possible pairs of labels, even if they don't occur in training data (negative transition features).
Custom kernel type to be applied

Applies to the Custom kernel SVM categorization model type.
It's the kernel function to use to represent features.
Degree of polynomial for polynomial kernel

Applies to the Custom Kernel SVM categorization model type.
Affects the polynomial custom kernel. It's the degree of the polynomial function.
Fit batch size

Applies to online training extraction model types.
It's the size of the batches in which the training set gets divided.
Inverse of regularization strength

Applies to these categorization model types:
- Custom kernel SVM
- Logistic Regression
Penalty for misclassification. It's a regularization parameter for training data. It is a number and it is used to prevent over-fitting and under-fitting. If you have a big training set and you consider it representative, the value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small dataset, which may not be representative, it is better to select large values to avoid errors.
L1 regularization term on weights

Applies to the XGBoost categorization model type.

Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent over-fitting. When 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients.
L2 regularization term on weights

Applies to the XGBoost categorization model type.

Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent over-fitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When 0 no regularization is applied.
Learning rate

Applies to these categorization model types:
- GBoost
- XGBoost
The rate of adaptation of the model to learning, that is how quickly the error tends to decrease.
Rising the value of this parameter shrinks the contributions of each decision tree. This means faster training but possibly less effective models.
Left window size, CRF left window size

Applies to these extraction model types:
- Passive aggressive
- SVM sliding window
- SGD sliding window
- CRF
Number of tokens to the left of an annotation that the algorithm takes into account.
Max number of epochs

In online training experiments, if the maximum number of training epochs, that is the number of times the ML algorithm passes through the entire training set.
Training can stop before this number of iteration based on the value of the Patience parameter.
N. of iterations with no change

Applies to the GBoost categorization model type.

Parameter used as early stopping criterion. During training, if the score hasn't improved since the last iteration, the training stops. Value -1 means no improvement.
Normalize: penalize long documents to avoid their dominance in stats

Applies to the Complement Naive Bayes categorization model type.
When turned on, long documents are discarded to balance statistics.
Number of trees, Number of decision trees

Applies to these categorization model types:
- GBoost
- XGBoost
- Random Forest
Number of decision trees to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible over-fitting. On the other hand a low value implies a high level of generalization and this involves the risk of not predicting some classes and under-fitting.
Optimization problem algorithm

Applies to the Logistic Regression categorization model type.
It's the algorithm to use for the optimization problem.
Patience

Applies to online training experiments.
It's the maximum number of epochs without improvement that are tolerated before stopping iterations.
Right window size, CRF right window size

Applies to these extraction model types:
- Passive aggressive
- SVM sliding window
- SGD sliding window
- CRF
Number of tokens to the right of an annotation that the algorithm takes into account.
SGD alpha regularization parameter

Applies to the SGD categorization model type.
Regularization parameter for training data. It's a number and it is used to prevent over-fitting and under-fitting. Larger values set a stronger regularization.
Split criterion on tree nodes

Applies to the Random Forest categorization model type.

The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree.
Possible values:
- Gini impurity
- Entropy
Stop condition tolerance, Tolerance for early stopping

Applies to these categorization model types:
- Custom kernel SVM
- GBoost
- Logistic Regression
It's a number indicating how much error is tolerated before early stopping. Larger values mean less iterations.

Generic categorization parameters

Generic parameters apply to explainable categorization experiments.

Enable "onCategorizer" optimization

When enabled, categorization fine tuning is performed.
Enable "strict" hierarchical mode

When enabled, all ascending categories in the hierarchy are returned together with the aforementioned category.
For example, if the taxonomy models animals and the predicted category is cat then the entire hierarchical path cat > feline > mammals > vertebrates > animals is returned.
Enable "single label" mode

If enabled, the model predicts at most one category.

Rules generation

Rules generation parameters apply to these types of experiment:

Explainable categorization
Bootstrapped Studio project (categorization)
Explainable extraction
Thesaurus generation

Categorization experiments

Enable generation of syncon based rules

When enabled, generated rules can use the SYNCON attribute. If disabled, also the Enable generation of ancestor based rules parameter is disabled.
Enable generation of ancestor based rules

When enabled, generated rules can use the ANCESTOR attribute. This parameter is disabled if the Enable generation of syncon based rules parameter is disabled.
Max number of items in each rule

Maximum number of operands in rules' conditions.
Max number of rules for each taxonomy category

Maximum number of rules that can be generated for each category.
Min number of annotated documents for a category, to enable rules generation

Minimum number of documents in which a category has been annotated that is required to generate rules for that category.
Max number of rules in which any single item can participate

Maximum number of rules in which a text feature (e.g. a concept, a lemma, an exact word) can be used. it's used to control the excessive generation or rules.
Max number of elements in a single item of a rule

Maximum number of attributes that can be used in an operand of a rule condition.

Extraction experiments

Maximum number of conditions for any given rule

Maximum number of conditions to use in a rule.
Enable automatic minimum support setup

When turned on, Platform will automatically determine the minimum support, that is the minimum number of times a rule must match inside the training set to be included in the model.
This parameter is alternative to Custom minimum support threshold (see below).
Custom minimum support threshold

Manually entered alternative to Enable automatic minimum support setup (see above).
Enable automatic minimum confidence setup

When turned on, Platform will automatically determine the minimum confidence, that is the number of times a rule must match in the class target context to be included in the model.
This parameter is alternative to Custom minimum confidence to explore a rule (see below).
Custom minimum confidence to explore a rule

Manually entered alternative to Enable automatic minimum confidence setup (see above).
Minimum acceptance confidence threshold

Minimum threshold to determine that a rule is acceptable. Smaller values mean greater acceptance and imply a greater final project recall.
Minimum confidence improvement for adding a new condition to a rule

Minimum improvement of rule's confidence that an additional condition must bring in order to be included in the rule.
Enable concatenation of contiguous extractions

When turned on, contiguous extractions—composed of multiple adjacent tokens—are concatenated.

Thesaurus experiments

Template name

Template name for output records.
Field name

Name of the field where the concepts are extracted.
Use BLEMMA

When turned on, BLEMMA rules can be generated.
File/batch granularity

Maximum number of concepts whose extraction rules are placed in a single rule file.

Extraction feature options

Feature options apply to explainable extraction experiments.

Window size (in tokens) to the left of the token being predicted

Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) of the predicted token.
Window size (in tokens) to the right of the token being predicted

This parameter specifies the number of tokens to consider to the right (the subsequent) of the predicted token.
Minimum document frequency

Minimum number of documents in which a feature (e.g. a concept, a lemma, an exact word) must be present in order to include it in a rule.
Raw word form

When enabled, exact words can be used in rules to match text words.
Word base form (Lemma)

When enabled, base forms (lemmas) can be used in rules to match the corresponding attribute of text words.
Word Part-of-Speech

When enabled, part-of-speech (for example noun, verb, etc.) can be used in rules to match the corresponding attribute of text words.
Syncons

When enabled, Knowledge Graph concepts (syncons) can be used in rules to match the concept expressed by text words.
Ancestors

When enabled, Knowledge Graph concepts (syncons) that, in an ISA hierarchy, are the ascendant of a given concept can be used in rules to match concepts expressed by text words.
Numeric words

When enabled, numeric words (like 500) can be used in rules to match text words.
Use suffix of a word

When enabled, word suffixes can be used in rules to match text words suffixes.
Use prefix of a word

When enabled, word prefixes—also called stems or roots—can be used in rules to match text words prefixes.

Categorization fine tuning

Explainable categorization and bootstrapped Studio project experiments create models that can use JavaScript to extend and control the document analysis pipeline.
Fine tuning is performed with the onCategorizer event handling function, which is automatically invoked after categorization rules have been evaluated.

Fine tuning is performed and can be configured only if the Enable "onCategorizer" optimization parameter in the Generic parameters step of the experiment wizard is turned on.

Desired clean level

Categorization results cleanup is performed with the CLEAN function.
The value of this parameter affects the value argument of that function: if set to auto in explainable categorization experiments, the fine tuning algorithm iteratively guesses the best value to use starting from the value of the Default clean level parameter (see below).
Default clean level

Doesn't apply to bootstrapped Studio project experiments.
Initial value for the value argument of the CLEAN function when Desired clean level (see above) is set to auto.

Note

When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored.
Also, if Enable conservative clean (see below) is enabled, cleanup is skipped.
Desired filter sequence

The value to use for the filters argument of the FILTER function. It can be set to auto in explainable categorization experiments which means that the fine tuning algorithm iteratively guesses the best value starting from the value of the Default filter sequence parameter (see below).
Default filter sequence

Doesn't apply to bootstrapped Studio project experiments.
Initial value for the filters argument of the FILTER function when Desired filter sequence (see above) is set to auto.

Note

When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored and value 100 is used to keep only the category with the highest score.
Enable conservative clean

Doesn't apply to bootstrapped Studio project experiments.
If no category exceeds the clean level (see above), cleanup is not performed.
Max number of documents to be considered by the optimization algorithm

Doesn't apply to bootstrapped Studio project experiments.
Maximum number of documents used in the fine tuning process. Value -1 means no limit.

Extraction rules selection

Rules selection parameters apply to explainable extraction experiments. They affect the fine tuning of generated rules.
Rules are fine tuned by validating them against a subset of the training set, ranking them and selecting those with the highest scores.

Fine-tuning rules selecting only the most significant ones

When turned on, the rule validation and selection step is performed and all the other fine tuning parameters (see below) can be set.
Number of rules selection steps

Number of iterations in the rules validation and selection step.
Fraction of validation split

Percentage of the training set that is used by the validation and selection step.
Activate rules pruning

When enabled, the number of selected rules to keep and include in the model can be set with the Max number of rules to select parameter (see below).
Max number of rules to select

Maximum number of rules to keep after validation and selection. Rules are counted by scrolling through the list of selected rules in descending score order. This parameter can be set only if Activate rules pruning is enabled (see above).

F-Beta

F-Beta parameters apply to all experiments.

F-Beta is more general way of computing the F-score. F-beta parameters affect the balance between precision and recall when computing F-Measure at the end of the test phase of the experiment.

F-Beta parameters are:

Enable F-Beta optimization (tuning balance between precision and recall): when turned on, it is possible to set the Target F-Beta parameter.
Target F-Beta: value 1 gives the same weight to precision and recall, values lower than 1 give more weight to precision while values greater than 1 give more weight to recall.

Auto ML parameters

These parameters apply to Auto-ML experiments when you turn on Automatic features selection or Activate Auto-ML on every parameter.
In that case, Platforms trains a ML model that then uses to predict the best features and best hyperparameters' values to use when actually training the experiment model.

This assistant model is trained iteratively by passing through its training data multiple times. Its parameters are:

Number of training iterations for the AutoML algorithm: maximum number of self-tuning iterations.
Number of data splits for cross-validation of AutoML algorithm: number of subdivisions of training data.
Call back function for stopping the AutoML self-tuning process: early iteration termination policy. The stop can occur when a high score is reached, when a time limit is exceeded or when a combination of a good score and elapsed time occurs.
Target time deadline for the AutoML call back stop function (minutes): time limit beyond which the self-tuning algorithm stops iterating if a time-based early termination policy has been chosen (see the parameter above).

Layout information

The Analysis strategy for documents with layout information parameter applies to Studio experiments.

It determines how to manage graphical layout information than can be present in test documents. Test documents can have layout information if they originate form PDF files that were imported with the PDF document view option enabled.

Possible values are:

Require layout information: only documents with layout information are analyzed.
Rely on expert.ai Extract layout information where available: all documents are analyzed. The model will leverage layout information—when present—if it has been programmed to do so.
Plain text: all documents are analyzed and layout information is ignored, only the plain text is fed to the model.

Matching strategy

The matching strategy parameter applies to extraction experiments.

The value of the parameter (strict, ignore value or ignore position) determines the strategy used to compute experiment metrics.

Labels

These parameters apply to thesaurus generation experiments.

Consider labels coming from linked public sources

When turned on, labels deriving from linked public sources are considered when generating concept extraction rules.
Consider labels coming from linked projects

When turned on, labels deriving from linked internal sources are considered when generating concept extraction rules.