Experiment parameters
The parameters that can be set in the categorization, extraction and thesaurus experiments are described below.
Training documents selection
The selection of the documents to use to train the model is the first step of all the experiments that generate models. Its parameters are:
-
Training documents selection policy
This parameter applies to all the categorization and extraction experiments that generate a model.
It determines which documents from the training set are used for training. Possible values are:- Only validated annotated documents (strict): only annotated documents that have also been validated will be used to train the model.
- Only validated or annotated documents (strict): only documents that are annotated or validated will be used to train the model.
- Prefer validated documents: in case of sub-sampling, validated documents will be preferred over non-validated documents.
- Prefer annotated documents: in case of sub-sampling, annotated documents will be preferred over non-annotated documents.
- Random selection: the documents used to train the model will be randomly selected from the library.
-
Enable subsampling using random selection strategy
This parameter applies to ML categorization experiments and to Auto-ML extraction experiments.
When turned on, only a randomly selected subset of the training library is used to train the model. The Subsampling max documents parameter (see below) determines the size of the subset.
-
Subsampling max documents
This is a sub-parameter of Enable subsampling using random selection strategy (see above). It determines the size of the subset.
Training windows
The following parameters apply to ML extraction experiments and affect the areas of text around annotations (windows) that are considered in the training process.
-
Ignore non-annotated areas
When turned on, for non-validated documents, the only portions of text used to train the models are those around the annotations. The size of the area around annotations is determined by the Annotated area window size parameter (see below). For validated documents, instead, all the text is considered.
-
Annotated area window size
This is a sub-parameter of Ignore non-annotated areas (see above).
It's the size of the area around annotations to consider for training. It's expressed in sentences before and after the sentence containing the annotation, so for example value 2 means the area includes two sentences before and two sentences after.
-
Enable Negative Sub-sampling
It's alternative to Ignore non-annotated areas (see above).
When turned on, the training algorithm smartly chooses some non-annotated areas around annotations and excludes them from training in order to reduce noise.
Problem definition
Problem definition is a step of ML categorization experiments. Its parameters are:
-
Enable strict "single label" mode
When turned on, the model predicts at most one category for each document. When off, the model can detect any number of categories.
-
(only for Auto-ML Categorization) Enable strict "Sub document categorization" compatibility mode
When turned on, the generated model can predict categories for sub-documents, that are portions of the input document.
Platform authoring application does not handle the concept of sub-document and so, during an experiment, the model is trained with entire documents, annotated with the expected categories, and predicts categories for entire test library documents. For this reason, when you wave a sub-document categorization use case, it is necessary that the original documents are broken into sub-documents using an external tool and then sub-documents are imported as "normal" documents in the training and test libraries. This way you train and test the model on documents that are, indeed, chunks of larger source documents.
Once published and inserted in a workflow, however, the model can effectively manage sub-documents provided that:
- The input to the model block contains the text of the whole document plus the information (boundaries) that identifies sub-documents inside it.
- The type of boundaries is specified as a configuration parameters of the model block.
Under these conditions the model makes predictions for sub-documents and each output category is accompanied by the boundaries of the portion of the input document which identify the sub-document the category refers to.
Feature space
Feature space parameters affect ML categorization and extraction experiments and determine which features of the text are used to train the model.
In Auto ML experiments, Platform can automatically decide the features to use. This behavior is activated when the Automatic features selection option is turned on.
Available features that can be used or not are:
Feature | Description | Experiment type |
---|---|---|
Alpha Numeric words | Words consisting of both letters and digits | Extraction |
Alphabetic words | Words consisting only of letters | Extraction |
Collocations | Combined words, the combination has its own meaning (e.g. credit card or take a risk) | Extraction |
Decimal number words | Words representing decimal numbers | Extraction |
Digit words | Words consisting only of digits | Extraction |
Entities | Named entities like people, places and organizations | Categorization and extraction |
Knowledge Graph relations | Knowledge Graph ascending concepts, along ISA-type relationships, of the concept corresponding to text words (e.g. dentist is a medical specialist, which is a doctor, which is a professional) | Categorization and extraction |
Knowledge Label | Main lemma of the concept which, inside the Knowledge Graph, in an ISA relation, is the parent of the concept corresponding to the text word (e.g. if the text word is moratorium, its parent concept's label is legal action ) | Categorization |
Known Concepts | Knowledge Graph concepts (syncons) for text words that are well known proper nouns (e.g. World Cup, United States) | Extraction |
Logic dependencies | Syntactic relationships between text words (e.g. subject-verb-object) | Extraction |
Main lemma | Base forms (lemmas) of the most important words of the text | Categorization |
Main Syncons | Knowledge Graph concepts (syncons) of the most important words of the text | Categorization |
Main Topics | Knowledge Graph topics the text is primarily about | Categorization |
Mixed case words | Words consisting of both uppercase and lowercase letters | Extraction |
Numeric words | Words that represent numbers | Extraction |
Phrases | Phrases, i.e. one or more words that form a meaningful grammatical unit | Extraction |
Sub-words | Parts of a word like morphemes, stems and roots | Categorization |
Syncon Topics | The topics that in the Knowledge Graph are attributed to the concepts (syncon) corresponding to the text words | Categorization |
Syncons | Knowledge Graph concepts (syncon) corresponding to text words | Categorization and extraction |
Title case words | Capitalized words | Extraction |
Upper case word | Words consisting only of uppercase letters | Extraction |
Use word embeddings | Static word embeddings | Categorization |
Word base form (Lemma) | Base form (lemma) of text words (e.g. run for running and ran) | Categorization and extraction |
Word base form stem | Stem of text words (e.g. intern for international) | Categorization |
Word form | Text word exactly as written in the text | Categorization |
Word Part-of-Speech | Part-of-speech of a word (e.g. noun, verb, adjective) | Extraction |
Other feature space parameters are:
-
For both categorization and extraction experiments:
- Max features: the maximum number of text features to use to train the model. Value 0 means all available features.
-
For categorization experiments:
- Maximum N for N-grams: the N to compute N-grams for stem and keywords features. All the N-Grams up to N will be computed, thus value 1 means that only Unigrams will be computed while with 3 Unigrams, Bigrams and Trigrams will be used.
- Min DF: the minimum number of documents in which a feature must appear to be considered for training.
- Max DF: the maximum percentage of documents in which a feature can appear to be considered for training.
-
For extraction experiments:
- Min WF: the minimum number of windows (areas around annotations) per document in which a feature must appear in order to be considered for training.
- Max WF: the maximum percentage of windows per document in which a feature can appear to be considered for training.
Hyperparameters
Hyperparameters apply to ML categorization and extraction experiments.
-
Alpha regularizer
Applies to these categorization model types:
Regularization parameter, smoothing factor on term counts. Large values increase the regularization.
-
C parameter: penalty for misclassifications
Applies to these categorization model types:
The C parameter is a regularization (or generalization) parameter for training data.
It is a number and it is used to prevent over-fitting and under-fitting.
If you have a big training set and you consider it representative, the parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small training set, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization. -
Class weight
Applies to these categorization model types:
Regularization parameter to balance categories. Possible values are:
- Balanced
- None
If one category is preponderant over all the others in the training set, balancing categories prevent unbalanced predictions for less represented classes. If the training model is highly representative, balancing makes the model a little less performing. If, on the other hand, the training model is not very representative and balancing is not enabled, model performance is poor.
-
CRF c1 regularization coefficient
Applies to the CRF extraction model type.
Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value.
-
CRF c2 regularization coefficient
Applies to the CRF extraction model type.
Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms.
-
CRF Forced use of all possible states
Applies to the CRF extraction model type.
When enabled, the algorithm generates state features for all the combinations of attributes and labels and that possibly don't occur in the training data (negative state features). This may improve labeling accuracy but slow down the training process. -
CRF Forced use of all possible transitions
Applies to the CRF extraction model type.
When enabled, the algorithm generates transition features for all the possible pairs of labels, even if they don't occur in training data (negative transition features). -
Custom kernel type to be applied
Applies to the Custom kernel SVM categorization model type.
It's the kernel function to use to represent features. -
Degree of polynomial for polynomial kernel
Applies to the Custom Kernel SVM categorization model type.
Affects the polynomial custom kernel. It's the degree of the polynomial function. -
Fit batch size
Applies to online training extraction model types.
It's the size of the batches in which the training set gets divided. -
Inverse of regularization strength
Applies to these categorization model types:
Penalty for misclassification. It's a regularization parameter for training data. It is a number and it is used to prevent over-fitting and under-fitting. If you have a big training set and you consider it representative, the value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small dataset, which may not be representative, it is better to select large values to avoid errors.
-
L1 regularization term on weights
Applies to the XGBoost categorization model type.
Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent over-fitting. When 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients.
-
L2 regularization term on weights
Applies to the XGBoost categorization model type.
Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent over-fitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When 0 no regularization is applied.
-
Learning rate
Applies to these categorization model types:
The rate of adaptation of the model to learning, that is how quickly the error tends to decrease.
Rising the value of this parameter shrinks the contributions of each decision tree. This means faster training but possibly less effective models. -
Left window size, CRF left window size
Applies to these extraction model types:
Number of tokens to the left of an annotation that the algorithm takes into account.
-
Max number of epochs
In online training experiments, if the maximum number of training epochs, that is the number of times the ML algorithm passes through the entire training set.
Training can stop before this number of iteration based on the value of the Patience parameter. -
N. of iterations with no change
Applies to the GBoost categorization model type.
Parameter used as early stopping criterion. During training, if the score hasn't improved since the last iteration, the training stops. Value -1 means no improvement.
-
Normalize: penalize long documents to avoid their dominance in stats
Applies to the Complement Naive Bayes categorization model type.
When turned on, long documents are discarded to balance statistics. -
Number of trees, Number of decision trees
Applies to these categorization model types:
Number of decision trees to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible over-fitting. On the other hand a low value implies a high level of generalization and this involves the risk of not predicting some classes and under-fitting.
-
Optimization problem algorithm
Applies to the Logistic Regression categorization model type.
It's the algorithm to use for the optimization problem. -
Patience
Applies to online training experiments.
It's the maximum number of epochs without improvement that are tolerated before stopping iterations. -
Right window size, CRF right window size
Applies to these extraction model types:
Number of tokens to the right of an annotation that the algorithm takes into account.
-
SGD alpha regularization parameter
Applies to the SGD categorization model type.
Regularization parameter for training data. It's a number and it is used to prevent over-fitting and under-fitting. Larger values set a stronger regularization. -
Split criterion on tree nodes
Applies to the Random Forest categorization model type.
The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree.
-
Stop condition tolerance, Tolerance for early stopping
Applies to these categorization model types:
It's a number indicating how much error is tolerated before early stopping. Larger values mean less iterations.
Generic categorization parameters
Generic parameters apply to explainable categorization experiments.
-
Enable "onCategorizer" optimization
When enabled, categorization fine tuning is performed.
-
Enable "strict" hierarchical mode
When enabled, all ascending categories in the hierarchy are returned together with the aforementioned category.
For example, if the taxonomy models animals and the predicted category is cat then the entire hierarchical path cat > feline > mammals > vertebrates > animals is returned. -
Enable "single label" mode
If enabled, the model predicts at most one category.
Rules generation
Rules generation parameters apply to these types of experiment:
- Explainable categorization
- Bootstrapped Studio project (categorization)
- Explainable extraction
- Thesaurus generation
Categorization experiments
-
Enable generation of syncon based rules
When enabled, generated rules can use the
SYNCON
attribute. If disabled, also the Enable generation of ancestor based rules parameter is disabled. -
Enable generation of ancestor based rules
When enabled, generated rules can use the
ANCESTOR
attribute. This parameter is disabled if the Enable generation of syncon based rules parameter is disabled. -
Max number of items in each rule
Maximum number of operands in rules' conditions.
-
Max number of rules for each taxonomy category
Maximum number of rules that can be generated for each category.
-
Min number of annotated documents for a category, to enable rules generation
Minimum number of documents in which a category has been annotated that is required to generate rules for that category.
-
Max number of rules in which any single item can participate
Maximum number of rules in which a text feature (e.g. a concept, a lemma, an exact word) can be used. it's used to control the excessive generation or rules.
-
Max number of elements in a single item of a rule
Maximum number of attributes that can be used in an operand of a rule condition.
Extraction experiments
-
Maximum number of conditions for any given rule
Maximum number of conditions to use in a rule.
-
Enable automatic minimum support setup
When turned on, Platform will automatically determine the minimum support, that is the minimum number of times a rule must match inside the training set to be included in the model.
This parameter is alternative to Custom minimum support threshold (see below). -
Custom minimum support threshold
Manually entered alternative to Enable automatic minimum support setup (see above).
-
Enable automatic minimum confidence setup
When turned on, Platform will automatically determine the minimum confidence, that is the number of times a rule must match in the class target context to be included in the model.
This parameter is alternative to Custom minimum confidence to explore a rule (see below). -
Custom minimum confidence to explore a rule
Manually entered alternative to Enable automatic minimum confidence setup (see above).
-
Minimum acceptance confidence threshold
Minimum threshold to determine that a rule is acceptable. Smaller values mean greater acceptance and imply a greater final project recall.
-
Minimum confidence improvement for adding a new condition to a rule
Minimum improvement of rule's confidence that an additional condition must bring in order to be included in the rule.
-
Enable concatenation of contiguous extractions
When turned on, contiguous extractions—composed of multiple adjacent tokens—are concatenated.
Thesaurus experiments
-
Template name
Template name for output records.
-
Field name
Name of the field where the concepts are extracted.
-
Use BLEMMA
When turned on,
BLEMMA
rules can be generated. -
File/batch granularity
Maximum number of concepts whose extraction rules are placed in a single rule file.
Extraction feature options
Feature options apply to explainable extraction experiments.
-
Window size (in tokens) to the left of the token being predicted
Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) of the predicted token.
-
Window size (in tokens) to the right of the token being predicted
This parameter specifies the number of tokens to consider to the right (the subsequent) of the predicted token.
-
Minimum document frequency
Minimum number of documents in which a feature (e.g. a concept, a lemma, an exact word) must be present in order to include it in a rule.
-
Raw word form
When enabled, exact words can be used in rules to match text words.
-
Word base form (Lemma)
When enabled, base forms (lemmas) can be used in rules to match the corresponding attribute of text words.
-
Word Part-of-Speech
When enabled, part-of-speech (for example noun, verb, etc.) can be used in rules to match the corresponding attribute of text words.
-
Syncons
When enabled, Knowledge Graph concepts (syncons) can be used in rules to match the concept expressed by text words.
-
Ancestors
When enabled, Knowledge Graph concepts (syncons) that, in an ISA hierarchy, are the ascendant of a given concept can be used in rules to match concepts expressed by text words.
-
Numeric words
When enabled, numeric words (like 500) can be used in rules to match text words.
-
Use suffix of a word
When enabled, word suffixes can be used in rules to match text words suffixes.
-
Use prefix of a word
When enabled, word prefixes—also called stems or roots—can be used in rules to match text words prefixes.
Categorization fine tuning
Explainable categorization and bootstrapped Studio project experiments create models that can use JavaScript to extend and control the document analysis pipeline.
Fine tuning is performed with the onCategorizer event handling function, which is automatically invoked after categorization rules have been evaluated.
Fine tuning is performed and can be configured only if the Enable "onCategorizer" optimization parameter in the Generic parameters step of the experiment wizard is turned on.
-
Desired clean level
Categorization results cleanup is performed with the
CLEAN
function.
The value of this parameter affects thevalue
argument of that function: if set to auto in explainable categorization experiments, the fine tuning algorithm iteratively guesses the best value to use starting from the value of the Default clean level parameter (see below). -
Default clean level
Doesn't apply to bootstrapped Studio project experiments.
Initial value for thevalue
argument of theCLEAN
function when Desired clean level (see above) is set to auto.Note
When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored.
Also, if Enable conservative clean (see below) is enabled, cleanup is skipped. -
Desired filter sequence
The value to use for the
filters
argument of theFILTER
function. It can be set to auto in explainable categorization experiments which means that the fine tuning algorithm iteratively guesses the best value starting from the value of the Default filter sequence parameter (see below). -
Default filter sequence
Doesn't apply to bootstrapped Studio project experiments.
Initial value for thefilters
argument of theFILTER
function when Desired filter sequence (see above) is set to auto.Note
When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored and value 100 is used to keep only the category with the highest score.
-
Enable conservative clean
Doesn't apply to bootstrapped Studio project experiments.
If no category exceeds the clean level (see above), cleanup is not performed. -
Max number of documents to be considered by the optimization algorithm
Doesn't apply to bootstrapped Studio project experiments.
Maximum number of documents used in the fine tuning process. Value -1 means no limit.
Extraction rules selection
Rules selection parameters apply to explainable extraction experiments. They affect the fine tuning of generated rules.
Rules are fine tuned by validating them against a subset of the training set, ranking them and selecting those with the highest scores.
-
Fine-tuning rules selecting only the most significant ones
When turned on, the rule validation and selection step is performed and all the other fine tuning parameters (see below) can be set.
-
Number of rules selection steps
Number of iterations in the rules validation and selection step.
-
Fraction of validation split
Percentage of the training set that is used by the validation and selection step.
-
Activate rules pruning
When enabled, the number of selected rules to keep and include in the model can be set with the Max number of rules to select parameter (see below).
-
Max number of rules to select
Maximum number of rules to keep after validation and selection. Rules are counted by scrolling through the list of selected rules in descending score order. This parameter can be set only if Activate rules pruning is enabled (see above).
F-Beta
F-Beta parameters apply to all experiments.
F-Beta is more general way of computing the F-score. F-beta parameters affect the balance between precision and recall when computing F-Measure at the end of the test phase of the experiment.
F-Beta parameters are:
- Enable F-Beta optimization (tuning balance between precision and recall): when turned on, it is possible to set the Target F-Beta parameter.
- Target F-Beta: value 1 gives the same weight to precision and recall, values lower than 1 give more weight to precision while values greater than 1 give more weight to recall.
Auto ML parameters
These parameters apply to Auto-ML experiments when you turn on Automatic features selection or Activate Auto-ML on every parameter.
In that case, Platforms trains a ML model that then uses to predict the best features and best hyperparameters' values to use when actually training the experiment model.
This assistant model is trained iteratively by passing through its training data multiple times. Its parameters are:
- Number of training iterations for the AutoML algorithm: maximum number of self-tuning iterations.
- Number of data splits for cross-validation of AutoML algorithm: number of subdivisions of training data.
- Call back function for stopping the AutoML self-tuning process: early iteration termination policy. The stop can occur when a high score is reached, when a time limit is exceeded or when a combination of a good score and elapsed time occurs.
- Target time deadline for the AutoML call back stop function (minutes): time limit beyond which the self-tuning algorithm stops iterating if a time-based early termination policy has been chosen (see the parameter above).
Layout information
The Analysis strategy for documents with layout information parameter applies to Studio experiments.
It determines how to manage graphical layout information than can be present in test documents. Test documents can have layout information if they originate form PDF files that were imported with the PDF document view option enabled.
Possible values are:
- Require layout information: only documents with layout information are analyzed.
- Rely on expert.ai Extract layout information where available: all documents are analyzed. The model will leverage layout information—when present—if it has been programmed to do so.
- Plain text: all documents are analyzed and layout information is ignored, only the plain text is fed to the model.
Matching strategy
The matching strategy parameter applies to extraction experiments.
The value of the parameter (strict, ignore value or ignore position) determines the strategy used to compute experiment metrics.
Labels
These parameters apply to thesaurus generation experiments.
-
Consider labels coming from linked public sources
When turned on, labels deriving from linked public sources are considered when generating concept extraction rules.
-
Consider labels coming from linked projects
When turned on, labels deriving from linked internal sources are considered when generating concept extraction rules.