Skip to content

Experiment parameters

The parameters that can be set in the categorization, extraction and thesaurus experiments are described below.

Training documents selection

The selection of the documents to use to train the model is the first step of all the experiments that generate models. Its parameters are:

  • Training documents selection policy

    This parameter applies to all the categorization and extraction experiments that generate a model.
    It determines which documents from the training set are used for training. Possible values are:

    • Only validated annotated documents (strict): only annotated documents that have also been validated will be used to train the model.
    • Only validated or annotated documents (strict): only documents that are annotated or validated will be used to train the model.
    • Prefer validated documents: in case of sub-sampling, validated documents will be preferred over non-validated documents.
    • Prefer annotated documents: in case of sub-sampling, annotated documents will be preferred over non-annotated documents.
    • Random selection: the documents used to train the model will be randomly selected from the library.
  • Enable subsampling using random selection strategy

    This parameter applies to ML categorization experiments and to Auto-ML extraction experiments.

    When turned on, only a randomly selected subset of the training library is used to train the model. The Subsampling max documents parameter (see below) determines the size of the subset.

  • Subsampling max documents

    This is a sub-parameter of Enable subsampling using random selection strategy (see above). It determines the size of the subset.

Training windows

The following parameters apply to ML extraction experiments and affect the areas of text around annotations (windows) that are considered in the training process.

  • Ignore non-annotated areas

    When turned on, for non-validated documents, the only portions of text used to train the models are those around the annotations. The size of the area around annotations is determined by the Annotated area window size parameter (see below). For validated documents, instead, all the text is considered.

  • Annotated area window size

    This is a sub-parameter of Ignore non-annotated areas (see above).

    It's the size of the area around annotations to consider for training. It's expressed in sentences before and after the sentence containing the annotation, so for example value 2 means the area includes two sentences before and two sentences after.

  • Enable Negative Sub-sampling

    It's alternative to Ignore non-annotated areas (see above).

    When turned on, the training algorithm smartly chooses some non-annotated areas around annotations and excludes them from training in order to reduce noise.

Problem definition

Problem definition is a step of ML categorization experiments. Its parameters are:

  • Enable strict "single label" mode

    When turned on, the model predicts at most one category for each document. When off, the model can detect any number of categories.

  • (only for Auto-ML Categorization) Enable strict "Sub document categorization" compatibility mode

    When turned on, the generated model can predict categories for sub-documents, that are portions of the input document.

    Platform authoring application does not handle the concept of sub-document and so, during an experiment, the model is trained with entire documents, annotated with the expected categories, and predicts categories for entire test library documents. For this reason, when you wave a sub-document categorization use case, it is necessary that the original documents are broken into sub-documents using an external tool and then sub-documents are imported as "normal" documents in the training and test libraries. This way you train and test the model on documents that are, indeed, chunks of larger source documents.

    Once published and inserted in a workflow, however, the model can effectively manage sub-documents provided that:

    1. The input to the model block contains the text of the whole document plus the information (boundaries) that identifies sub-documents inside it.
    2. The type of boundaries is specified as a configuration parameters of the model block.

    Under these conditions the model makes predictions for sub-documents and each output category is accompanied by the boundaries of the portion of the input document which identify the sub-document the category refers to.

Feature space

Feature space parameters affect ML categorization and extraction experiments and determine which features of the text are used to train the model.

In Auto ML experiments, Platform can automatically decide the features to use. This behavior is activated when the Automatic features selection option is turned on.

Available features that can be used or not are:

Feature Description Experiment type
Alpha Numeric words Words consisting of both letters and digits Extraction
Alphabetic words Words consisting only of letters Extraction
Collocations Combined words, the combination has its own meaning (e.g. credit card or take a risk) Extraction
Decimal number words Words representing decimal numbers Extraction
Digit words Words consisting only of digits Extraction
Entities Named entities like people, places and organizations Categorization and extraction
Knowledge Graph relations Knowledge Graph ascending concepts, along ISA-type relationships, of the concept corresponding to text words (e.g. dentist is a medical specialist, which is a doctor, which is a professional) Categorization and extraction
Knowledge Label Main lemma of the concept which, inside the Knowledge Graph, in an ISA relation, is the parent of the concept corresponding to the text word (e.g. if the text word is moratorium, its parent concept's label is legal action ) Categorization
Known Concepts Knowledge Graph concepts (syncons) for text words that are well known proper nouns (e.g. World Cup, United States) Extraction
Logic dependencies Syntactic relationships between text words (e.g. subject-verb-object) Extraction
Main lemma Base forms (lemmas) of the most important words of the text Categorization
Main Syncons Knowledge Graph concepts (syncons) of the most important words of the text Categorization
Main Topics Knowledge Graph topics the text is primarily about Categorization
Mixed case words Words consisting of both uppercase and lowercase letters Extraction
Numeric words Words that represent numbers Extraction
Phrases Phrases, i.e. one or more words that form a meaningful grammatical unit Extraction
Sub-words Parts of a word like morphemes, stems and roots Categorization
Syncon Topics The topics that in the Knowledge Graph are attributed to the concepts (syncon) corresponding to the text words Categorization
Syncons Knowledge Graph concepts (syncon) corresponding to text words Categorization and extraction
Title case words Capitalized words Extraction
Upper case word Words consisting only of uppercase letters Extraction
Use word embeddings Static word embeddings Categorization
Word base form (Lemma) Base form (lemma) of text words (e.g. run for running and ran) Categorization and extraction
Word base form stem Stem of text words (e.g. intern for international) Categorization
Word form Text word exactly as written in the text Categorization
Word Part-of-Speech Part-of-speech of a word (e.g. noun, verb, adjective) Extraction

Other feature space parameters are:

  • For both categorization and extraction experiments:

    • Max features: the maximum number of text features to use to train the model. Value 0 means all available features.
  • For categorization experiments:

    • Maximum N for N-grams: the N to compute N-grams for stem and keywords features. All the N-Grams up to N will be computed, thus value 1 means that only Unigrams will be computed while with 3 Unigrams, Bigrams and Trigrams will be used.
    • Min DF: the minimum number of documents in which a feature must appear to be considered for training.
    • Max DF: the maximum percentage of documents in which a feature can appear to be considered for training.
  • For extraction experiments:

    • Min WF: the minimum number of windows (areas around annotations) per document in which a feature must appear in order to be considered for training.
    • Max WF: the maximum percentage of windows per document in which a feature can appear to be considered for training.

Hyperparameters

Hyperparameters apply to ML categorization and extraction experiments.

  • Alpha regularizer

    Applies to these categorization model types:

    Regularization parameter, smoothing factor on term counts. Large values increase the regularization.

  • C parameter: penalty for misclassifications

    Applies to these categorization model types:

    The C parameter is a regularization (or generalization) parameter for training data.
    It is a number and it is used to prevent over-fitting and under-fitting.
    If you have a big training set and you consider it representative, the parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small training set, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization.

  • Class weight

    Applies to these categorization model types:

    Regularization parameter to balance categories. Possible values are:

    • Balanced
    • None

    If one category is preponderant over all the others in the training set, balancing categories prevent unbalanced predictions for less represented classes. If the training model is highly representative, balancing makes the model a little less performing. If, on the other hand, the training model is not very representative and balancing is not enabled, model performance is poor.

  • CRF c1 regularization coefficient

    Applies to the CRF extraction model type.

    Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value.

  • CRF c2 regularization coefficient

    Applies to the CRF extraction model type.

    Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms.

  • CRF Forced use of all possible states

    Applies to the CRF extraction model type.
    When enabled, the algorithm generates state features for all the combinations of attributes and labels and that possibly don't occur in the training data (negative state features). This may improve labeling accuracy but slow down the training process.

  • CRF Forced use of all possible transitions

    Applies to the CRF extraction model type.
    When enabled, the algorithm generates transition features for all the possible pairs of labels, even if they don't occur in training data (negative transition features).

  • Custom kernel type to be applied

    Applies to the Custom kernel SVM categorization model type.
    It's the kernel function to use to represent features.

  • Degree of polynomial for polynomial kernel

    Applies to the Custom Kernel SVM categorization model type.
    Affects the polynomial custom kernel. It's the degree of the polynomial function.

  • Fit batch size

    Applies to online training extraction model types.
    It's the size of the batches in which the training set gets divided.

  • Inverse of regularization strength

    Applies to these categorization model types:

    Penalty for misclassification. It's a regularization parameter for training data. It is a number and it is used to prevent over-fitting and under-fitting. If you have a big training set and you consider it representative, the value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small dataset, which may not be representative, it is better to select large values to avoid errors.

  • L1 regularization term on weights

    Applies to the XGBoost categorization model type.

    Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent over-fitting. When 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients.

  • L2 regularization term on weights

    Applies to the XGBoost categorization model type.

    Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent over-fitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When 0 no regularization is applied.

  • Learning rate

    Applies to these categorization model types:

    The rate of adaptation of the model to learning, that is how quickly the error tends to decrease.
    Rising the value of this parameter shrinks the contributions of each decision tree. This means faster training but possibly less effective models.

  • Left window size, CRF left window size

    Applies to these extraction model types:

    Number of tokens to the left of an annotation that the algorithm takes into account.

  • Max number of epochs

    In online training experiments, if the maximum number of training epochs, that is the number of times the ML algorithm passes through the entire training set.
    Training can stop before this number of iteration based on the value of the Patience parameter.

  • N. of iterations with no change

    Applies to the GBoost categorization model type.

    Parameter used as early stopping criterion. During training, if the score hasn't improved since the last iteration, the training stops. Value -1 means no improvement.

  • Normalize: penalize long documents to avoid their dominance in stats

    Applies to the Complement Naive Bayes categorization model type.
    When turned on, long documents are discarded to balance statistics.

  • Number of trees, Number of decision trees

    Applies to these categorization model types:

    Number of decision trees to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible over-fitting. On the other hand a low value implies a high level of generalization and this involves the risk of not predicting some classes and under-fitting.

  • Optimization problem algorithm

    Applies to the Logistic Regression categorization model type.
    It's the algorithm to use for the optimization problem.

  • Patience

    Applies to online training experiments.
    It's the maximum number of epochs without improvement that are tolerated before stopping iterations.

  • Right window size, CRF right window size

    Applies to these extraction model types:

    Number of tokens to the right of an annotation that the algorithm takes into account.

  • SGD alpha regularization parameter

    Applies to the SGD categorization model type.
    Regularization parameter for training data. It's a number and it is used to prevent over-fitting and under-fitting. Larger values set a stronger regularization.

  • Split criterion on tree nodes

    Applies to the Random Forest categorization model type.

    The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree.

  • Stop condition tolerance, Tolerance for early stopping

    Applies to these categorization model types:

    It's a number indicating how much error is tolerated before early stopping. Larger values mean less iterations.

Generic categorization parameters

Generic parameters apply to explainable categorization experiments.

  • Enable "onCategorizer" optimization

    When enabled, categorization fine tuning is performed.

  • Enable "strict" hierarchical mode

    When enabled, all ascending categories in the hierarchy are returned together with the aforementioned category.
    For example, if the taxonomy models animals and the predicted category is cat then the entire hierarchical path cat > feline > mammals > vertebrates > animals is returned.

  • Enable "single label" mode

    If enabled, the model predicts at most one category.

Rules generation

Rules generation parameters apply to these types of experiment:

  • Explainable categorization
  • Bootstrapped Studio project (categorization)
  • Explainable extraction
  • Thesaurus generation

Categorization experiments

  • Enable generation of syncon based rules

    When enabled, generated rules can use the SYNCON attribute. If disabled, also the Enable generation of ancestor based rules parameter is disabled.

  • Enable generation of ancestor based rules

    When enabled, generated rules can use the ANCESTOR attribute. This parameter is disabled if the Enable generation of syncon based rules parameter is disabled.

  • Max number of items in each rule

    Maximum number of operands in rules' conditions.

  • Max number of rules for each taxonomy category

    Maximum number of rules that can be generated for each category.

  • Min number of annotated documents for a category, to enable rules generation

    Minimum number of documents in which a category has been annotated that is required to generate rules for that category.

  • Max number of rules in which any single item can participate

    Maximum number of rules in which a text feature (e.g. a concept, a lemma, an exact word) can be used. it's used to control the excessive generation or rules.

  • Max number of elements in a single item of a rule

    Maximum number of attributes that can be used in an operand of a rule condition.

Extraction experiments

  • Maximum number of conditions for any given rule

    Maximum number of conditions to use in a rule.

  • Enable automatic minimum support setup

    When turned on, Platform will automatically determine the minimum support, that is the minimum number of times a rule must match inside the training set to be included in the model.
    This parameter is alternative to Custom minimum support threshold (see below).

  • Custom minimum support threshold

    Manually entered alternative to Enable automatic minimum support setup (see above).

  • Enable automatic minimum confidence setup

    When turned on, Platform will automatically determine the minimum confidence, that is the number of times a rule must match in the class target context to be included in the model.
    This parameter is alternative to Custom minimum confidence to explore a rule (see below).

  • Custom minimum confidence to explore a rule

    Manually entered alternative to Enable automatic minimum confidence setup (see above).

  • Minimum acceptance confidence threshold

    Minimum threshold to determine that a rule is acceptable. Smaller values mean greater acceptance and imply a greater final project recall.

  • Minimum confidence improvement for adding a new condition to a rule

    Minimum improvement of rule's confidence that an additional condition must bring in order to be included in the rule.

  • Enable concatenation of contiguous extractions

    When turned on, contiguous extractions—composed of multiple adjacent tokens—are concatenated.

Thesaurus experiments

  • Template name

    Template name for output records.

  • Field name

    Name of the field where the concepts are extracted.

  • Use BLEMMA

    When turned on, BLEMMA rules can be generated.

  • File/batch granularity

    Maximum number of concepts whose extraction rules are placed in a single rule file.

Extraction feature options

Feature options apply to explainable extraction experiments.

  • Window size (in tokens) to the left of the token being predicted

    Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) of the predicted token.

  • Window size (in tokens) to the right of the token being predicted

    This parameter specifies the number of tokens to consider to the right (the subsequent) of the predicted token.

  • Minimum document frequency

    Minimum number of documents in which a feature (e.g. a concept, a lemma, an exact word) must be present in order to include it in a rule.

  • Raw word form

    When enabled, exact words can be used in rules to match text words.

  • Word base form (Lemma)

    When enabled, base forms (lemmas) can be used in rules to match the corresponding attribute of text words.

  • Word Part-of-Speech

    When enabled, part-of-speech (for example noun, verb, etc.) can be used in rules to match the corresponding attribute of text words.

  • Syncons

    When enabled, Knowledge Graph concepts (syncons) can be used in rules to match the concept expressed by text words.

  • Ancestors

    When enabled, Knowledge Graph concepts (syncons) that, in an ISA hierarchy, are the ascendant of a given concept can be used in rules to match concepts expressed by text words.

  • Numeric words

    When enabled, numeric words (like 500) can be used in rules to match text words.

  • Use suffix of a word

    When enabled, word suffixes can be used in rules to match text words suffixes.

  • Use prefix of a word

    When enabled, word prefixes—also called stems or roots—can be used in rules to match text words prefixes.

Categorization fine tuning

Explainable categorization and bootstrapped Studio project experiments create models that can use JavaScript to extend and control the document analysis pipeline.
Fine tuning is performed with the onCategorizer event handling function, which is automatically invoked after categorization rules have been evaluated.

Fine tuning is performed and can be configured only if the Enable "onCategorizer" optimization parameter in the Generic parameters step of the experiment wizard is turned on.

  • Desired clean level

    Categorization results cleanup is performed with the CLEAN function.
    The value of this parameter affects the value argument of that function: if set to auto in explainable categorization experiments, the fine tuning algorithm iteratively guesses the best value to use starting from the value of the Default clean level parameter (see below).

  • Default clean level

    Doesn't apply to bootstrapped Studio project experiments.
    Initial value for the value argument of the CLEAN function when Desired clean level (see above) is set to auto.

    Note

    When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored.
    Also, if Enable conservative clean (see below) is enabled, cleanup is skipped.

  • Desired filter sequence

    The value to use for the filters argument of the FILTER function. It can be set to auto in explainable categorization experiments which means that the fine tuning algorithm iteratively guesses the best value starting from the value of the Default filter sequence parameter (see below).

  • Default filter sequence

    Doesn't apply to bootstrapped Studio project experiments.
    Initial value for the filters argument of the FILTER function when Desired filter sequence (see above) is set to auto.

    Note

    When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored and value 100 is used to keep only the category with the highest score.

  • Enable conservative clean

    Doesn't apply to bootstrapped Studio project experiments.
    If no category exceeds the clean level (see above), cleanup is not performed.

  • Max number of documents to be considered by the optimization algorithm

    Doesn't apply to bootstrapped Studio project experiments.
    Maximum number of documents used in the fine tuning process. Value -1 means no limit.

Extraction rules selection

Rules selection parameters apply to explainable extraction experiments. They affect the fine tuning of generated rules.
Rules are fine tuned by validating them against a subset of the training set, ranking them and selecting those with the highest scores.

  • Fine-tuning rules selecting only the most significant ones

    When turned on, the rule validation and selection step is performed and all the other fine tuning parameters (see below) can be set.

  • Number of rules selection steps

    Number of iterations in the rules validation and selection step.

  • Fraction of validation split

    Percentage of the training set that is used by the validation and selection step.

  • Activate rules pruning

    When enabled, the number of selected rules to keep and include in the model can be set with the Max number of rules to select parameter (see below).

  • Max number of rules to select

    Maximum number of rules to keep after validation and selection. Rules are counted by scrolling through the list of selected rules in descending score order. This parameter can be set only if Activate rules pruning is enabled (see above).

F-Beta

F-Beta parameters apply to all experiments.

F-Beta is more general way of computing the F-score. F-beta parameters affect the balance between precision and recall when computing F-Measure at the end of the test phase of the experiment.

F-Beta parameters are:

  • Enable F-Beta optimization (tuning balance between precision and recall): when turned on, it is possible to set the Target F-Beta parameter.
  • Target F-Beta: value 1 gives the same weight to precision and recall, values lower than 1 give more weight to precision while values greater than 1 give more weight to recall.

Auto ML parameters

These parameters apply to Auto-ML experiments when you turn on Automatic features selection or Activate Auto-ML on every parameter.
In that case, Platforms trains a ML model that then uses to predict the best features and best hyperparameters' values to use when actually training the experiment model.

This assistant model is trained iteratively by passing through its training data multiple times. Its parameters are:

  • Number of training iterations for the AutoML algorithm: maximum number of self-tuning iterations.
  • Number of data splits for cross-validation of AutoML algorithm: number of subdivisions of training data.
  • Call back function for stopping the AutoML self-tuning process: early iteration termination policy. The stop can occur when a high score is reached, when a time limit is exceeded or when a combination of a good score and elapsed time occurs.
  • Target time deadline for the AutoML call back stop function (minutes): time limit beyond which the self-tuning algorithm stops iterating if a time-based early termination policy has been chosen (see the parameter above).

Layout information

The Analysis strategy for documents with layout information parameter applies to Studio experiments.

It determines how to manage graphical layout information than can be present in test documents. Test documents can have layout information if they originate form PDF files that were imported with the PDF document view option enabled.

Possible values are:

  • Require layout information: only documents with layout information are analyzed.
  • Rely on expert.ai Extract layout information where available: all documents are analyzed. The model will leverage layout information—when present—if it has been programmed to do so.
  • Plain text: all documents are analyzed and layout information is ignored, only the plain text is fed to the model.

Matching strategy

The matching strategy parameter applies to extraction experiments.

The value of the parameter (strict, ignore value or ignore position) determines the strategy used to compute experiment metrics.

Labels

These parameters apply to thesaurus generation experiments.

  • Consider labels coming from linked public sources

    When turned on, labels deriving from linked public sources are considered when generating concept extraction rules.

  • Consider labels coming from linked projects

    When turned on, labels deriving from linked internal sources are considered when generating concept extraction rules.