Skip to content

Experiment parameters

The parameters that can be set in the categorization, extraction and thesaurus experiments are described below.

Training docs

The selection of the documents to use to train the model in the Training docs tab is the first step of categorization and extraction experiments that generate models. Its parameters are:

  • Training documents selection policy

    This parameter applies to all the categorization and extraction experiments that generate a model.
    It determines which documents from the training set are used for training. Possible values are:

    • Only validated documents (strict): only annotated documents that have also been validated will be used to train the model.
    • Only validated or annotated documents (strict): only documents that are annotated or validated will be used to train the model.
    • Prefer validated documents: in case of sub-sampling, validated documents will be preferred over non-validated documents.
    • Prefer annotated documents: in case of sub-sampling, annotated documents will be preferred over non-annotated documents.
    • Random selection: the documents used to train the model will be randomly selected from the library.
  • Enable subsampling using random selection strategy

    This parameter applies to ML categorization experiments and to Auto-ML extraction experiments.

    When turned on, only a randomly selected subset of the training library is used to train the model. The Subsampling max documents parameter (see below) determines the size of the subset.

  • Subsampling max documents

    This is a sub-parameter of Enable subsampling using random selection strategy (see above). It determines the size of the subset.

The following parameters apply to ML extraction experiments and affect the areas of text around annotations (windows) that are considered in the training process.

  • Ignore non-annotated areas

    When turned on, for non-validated documents, the only portions of text used to train the models are those around the annotations. The size of the area around annotations is determined by the Annotated area window size parameter (see below). For validated documents, instead, all the text is considered.

  • Annotated area window size

    This is a sub-parameter of Ignore non-annotated areas (see above).

    It's the size of the area around annotations to consider for training. It's expressed in sentences before and after the sentence containing the annotation, so for example value 2 means the area includes two sentences before and two sentences after.

  • Enable Negative Sub-sampling

    It's alternative to Ignore non-annotated areas (see above).

    When turned on, the training algorithm smartly chooses some non-annotated areas around annotations and excludes them from training in order to reduce noise.

Model type

Use the Model type tab to select a model type for your experiment. This tab is available for these experiments:

  • Auto-ML Extraction
  • Online-ML Categorization
  • Auto-ML Categorization
  • Online-ML Extraction

More information about model types in the dedicated pages.

Problem definition

Problem definition is a step of ML categorization experiments. Its parameters are described in the Problem definition tab and are:

  • Enable strict "single label" mode

    When turned on, the model predicts at most one category for each document. When off, the model can detect any number of categories.

  • Enable strict "Sub document categorization" compatibility mode (only with Auto-ML Categorization)

    When turned on, the generated model can predict categories for sub-documents, that are portions of the input document.

    Platform authoring application does not handle the concept of sub-document and so, during an experiment, the model is trained with entire documents, annotated with the expected categories, and predicts categories for entire test library documents. For this reason, when you wave a sub-document categorization use case, it is necessary that the original documents are broken into sub-documents using an external tool and then sub-documents are imported as "normal" documents in the training and test libraries. This way you train and test the model on documents that are, indeed, chunks of larger source documents.

    Once published and inserted in a workflow, however, the model can effectively manage sub-documents provided that:

    1. The input to the model block contains the text of the whole document plus the information (boundaries) that identifies sub-documents inside it.
    2. The type of boundaries is specified as a configuration parameters of the model block.

    Under these conditions the model makes predictions for sub-documents and each output category is accompanied by the boundaries of the portion of the input document which identify the sub-document the category refers to.

Feature space

Feature space parameters affect ML categorization and extraction experiments and determine which features of the text are used to train the model. They are described in the Feature space tab.

In Auto ML experiments, Platform can automatically decide the features to use. This behavior is activated when the Automatic features selection option is turned on.

Available features that can be used or not are:

Feature Description Experiment type
Alpha Numeric words Words consisting of both letters and digits Extraction
Alphabetic words Words consisting only of letters Extraction
Collocations Combined words, the combination has its own meaning (e.g. credit card or take a risk) Extraction
Decimal number words Words representing decimal numbers Extraction
Digit words Words consisting only of digits Extraction
Entities Named entities like people, places and organizations Categorization and extraction
Knowledge Graph relations Knowledge Graph ascending concepts, along ISA-type relationships, of the concept corresponding to text words (e.g. dentist is a medical specialist, which is a doctor, which is a professional) Categorization and extraction
Knowledge Label Main lemma of the concept which, inside the Knowledge Graph, in an ISA relation, is the parent of the concept corresponding to the text word (e.g. if the text word is moratorium, its parent concept's label is legal action ) Categorization
Known Concepts Knowledge Graph concepts (syncons) for text words that are well known proper nouns (e.g. World Cup, United States) Extraction
Logic dependencies Syntactic relationships between text words (e.g. subject-verb-object) Extraction
Main lemma Base forms (lemmas) of the most important words of the text Categorization
Main Syncons Knowledge Graph concepts (syncons) of the most important words of the text Categorization
Main Topics Knowledge Graph topics the text is primarily about Categorization
Mixed case words Words consisting of both uppercase and lowercase letters Extraction
Numeric words Words that represent numbers Extraction
Phrases Phrases, i.e. one or more words that form a meaningful grammatical unit Extraction
Sub-words Parts of a word like morphemes, stems and roots Categorization
Syncon Topics The topics that in the Knowledge Graph are attributed to the concepts (syncon) corresponding to the text words Categorization
Syncons Knowledge Graph concepts (syncon) corresponding to text words Categorization and extraction
Title case words Capitalized words Extraction
Upper case word Words consisting only of uppercase letters Extraction
Use word embeddings Static word embeddings Categorization
Word base form (Lemma) Base form (lemma) of text words (e.g. run for running and ran) Categorization and extraction
Word base form stem Stem of text words (e.g. intern for international) Categorization
Word form Text word exactly as written in the text Categorization
Word Part-of-Speech Part-of-speech of a word (e.g. noun, verb, adjective) Extraction

Other feature space parameters are:

  • For ML categorization and Online ML-Extraction experiments:

    • Max features: the maximum number of text features to use to train the model. Value 0 means all available features.
  • For categorization experiments:

    • Maximum N for N-grams: the N to compute N-grams for stem and keywords features. All the N-Grams up to N will be computed, thus value 1 means that only Unigrams will be computed while with 3 Unigrams, Bigrams and Trigrams will be used.
    • Min DF: the minimum number of documents in which a feature must appear to be considered for training.
    • Max DF: the maximum percentage of documents in which a feature can appear to be considered for training.
  • For Online ML-Extraction experiments:

    • Min WF: the minimum number of windows (areas around annotations) per document in which a feature must appear in order to be considered for training.
    • Max WF: the maximum percentage of windows per document in which a feature can appear to be considered for training.

Hyperparameters

Hyperparameters apply to ML categorization and extraction experiments. They are described in the Hyperparameters tab.

  • Alpha regularizer

    Applies to these categorization model types:

    Regularization parameter, smoothing factor on term counts. Large values increase the regularization.

  • C parameter: penalty for misclassifications

    Applies to these categorization model types:

    The C parameter is a regularization (or generalization) parameter for training data.
    It is a number and it is used to prevent over-fitting and under-fitting.
    If you have a big training set and you consider it representative, the parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small training set, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization.

  • Class weight

    Applies to these categorization model types:

    Regularization parameter to balance categories. Possible values are:

    • Balanced
    • None

    If one category is preponderant over all the others in the training set, balancing categories prevent unbalanced predictions for less represented classes. If the training model is highly representative, balancing makes the model a little less performing. If, on the other hand, the training model is not very representative and balancing is not enabled, model performance is poor.

  • CRF c1 regularization coefficient

    Applies to the CRF extraction model type.

    Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value.

  • CRF c2 regularization coefficient

    Applies to the CRF extraction model type.

    Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms.

  • CRF Forced use of all possible states

    Applies to the CRF extraction model type.
    When enabled, the algorithm generates state features for all the combinations of attributes and labels and that possibly don't occur in the training data (negative state features). This may improve labeling accuracy but slow down the training process.

  • CRF Forced use of all possible transitions

    Applies to the CRF extraction model type.
    When enabled, the algorithm generates transition features for all the possible pairs of labels, even if they don't occur in training data (negative transition features).

  • Custom kernel type to be applied

    Applies to the Custom kernel SVM categorization model type.
    It's the kernel function to use to represent features.

  • Degree of polynomial for polynomial kernel

    Applies to the Custom Kernel SVM categorization model type.
    Affects the polynomial custom kernel. It's the degree of the polynomial function.

  • Fit batch size

    Applies to online training extraction model types.
    It's the size of the batches in which the training set gets divided.

  • Inverse of regularization strength

    Applies to these categorization model types:

    Penalty for misclassification. It's a regularization parameter for training data. It is a number and it is used to prevent over-fitting and under-fitting. If you have a big training set and you consider it representative, the value should be small to increase the regularization and force the model to be tailored on training data. On the contrary, if you have a small dataset, which may not be representative, it is better to select large values to avoid errors.

  • L1 regularization term on weights

    Applies to the XGBoost categorization model type.

    Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent over-fitting. When 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients.

  • L2 regularization term on weights

    Applies to the XGBoost categorization model type.

    Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent over-fitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When 0 no regularization is applied.

  • Learning rate

    Applies to these categorization model types:

    The rate of adaptation of the model to learning, that is how quickly the error tends to decrease.
    Rising the value of this parameter shrinks the contributions of each decision tree. This means faster training but possibly less effective models.

  • Left window size, CRF left window size

    Applies to these extraction model types:

    Number of tokens to the left of an annotation that the algorithm takes into account.

  • Max number of epochs

    In online training experiments, if the maximum number of training epochs, that is the number of times the ML algorithm passes through the entire training set.
    Training can stop before this number of iteration based on the value of the Patience parameter.

  • N. of iterations with no changes

    Applies to the GBoost categorization model type.

    Parameter used as early stopping criterion. During training, if the score hasn't improved since the last iteration, the training stops. Value -1 means no improvement.

  • Normalize: penalize long documents to avoid their dominance in stats

    Applies to the Complement Naive Bayes categorization model type.
    When turned on, long documents are discarded to balance statistics.

  • Number of trees, Number of decision trees

    Apply to these categorization model types:

    Number of decision trees to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible over-fitting. On the other hand a low value implies a high level of generalization and this involves the risk of not predicting some classes and under-fitting.

  • Optimization problem algorithm

    Applies to the Logistic Regression categorization model type.
    It's the algorithm to use for the optimization problem.

  • Patience

    Applies to online training experiments.
    It's the maximum number of epochs without improvement that are tolerated before stopping iterations.

  • Right window size, CRF right window size

    Applies to these extraction model types:

    Number of tokens to the right of an annotation that the algorithm takes into account.

  • SGD alpha regularization parameter

    Applies to the SGD categorization model type.
    Regularization parameter for training data. It's a number and it is used to prevent over-fitting and under-fitting. Larger values set a stronger regularization.

  • Split criterion on tree nodes

    Applies to the Random Forest categorization model type.

    The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree.

  • Stop condition tolerance, Tolerance for early stopping

    Apply to these categorization model types:

    It's a number indicating how much error is tolerated before early stopping. Larger values mean less iterations.

Generic parameters

Generic parameters apply to explainable categorization experiments in the Generic parameters tab.

  • Enable "onCategorizer" optimization

    When enabled, categorization fine tuning is performed.

  • Enable "strict" hierarchical mode

    When enabled, all ascending categories in the hierarchy are returned together with the aforementioned category.
    For example, if the taxonomy models animals and the predicted category is cat then the entire hierarchical path cat > feline > mammals > vertebrates > animals is returned.

  • Enable "single label" mode

    If enabled, the model predicts at most one category.

Rules Generation

Rules generation parameters are described in the Rules Generation tab and they apply to these types of experiment:

  • Explainable categorization
  • Bootstrapped Studio project (categorization)
  • Explainable extraction
  • Thesaurus generation

Categorization experiments

  • Enable generation of syncon based rules

    When enabled, generated rules can use the SYNCON attribute. If disabled, also the Enable generation of ancestor based rules parameter is disabled.

  • Enable generation of ancestor based rules

    When enabled, generated rules can use the ANCESTOR attribute. This parameter is disabled if the Enable generation of syncon based rules parameter is disabled.

  • Max number of items in each rule

    Maximum number of operands in rules' conditions.

  • Max number of rules for each taxonomy category

    Maximum number of rules that can be generated for each category.

  • Min number of annotated documents for a category, to enable rules generation

    Minimum number of documents in which a category has been annotated that is required to generate rules for that category.

  • Max number of rules in which any single item can participate

    Maximum number of rules in which a text feature (e.g. a concept, a lemma, an exact word) can be used. it's used to control the excessive generation or rules.

  • Max number of elements in a single item of a rule

    Maximum number of attributes that can be used in an operand of a rule condition.

Extraction experiments

  • Maximum number of conditions for any given rule

    Maximum number of conditions to use in a rule.

  • Enable automatic minimum support setup

    When turned on, Platform will automatically determine the minimum support, that is the minimum number of times a rule must match inside the training set to be included in the model.
    This parameter is alternative to Custom minimum support threshold (see below).

  • Custom minimum support threshold

    Manually entered alternative to Enable automatic minimum support setup (see above).

  • Enable automatic minimum confidence setup

    When turned on, Platform will automatically determine the minimum confidence, that is the number of times a rule must match in the class target context to be included in the model.
    This parameter is alternative to Custom minimum confidence to explore a rule (see below).

  • Custom minimum confidence to explore a rule

    Manually entered alternative to Enable automatic minimum confidence setup (see above).

  • Minimum acceptance confidence threshold

    Minimum threshold to determine that a rule is acceptable. Smaller values mean greater acceptance and imply a greater final project recall.

  • Minimum confidence improvement for adding a new condition to a rule

    Minimum improvement of rule's confidence that an additional condition must bring in order to be included in the rule.

  • Enable concatenation of contiguous extractions

    When turned on, contiguous extractions—composed of multiple adjacent tokens—are concatenated.

Thesaurus experiments

  • Template name

    Template name for output records.

  • Field name

    Name of the field where the concepts are extracted.

  • File/batch granularity

    Maximum number of concepts whose extraction rules are placed in a single rule file.

  • Keep longest match

    When enabled, if different concepts are extracted from partially overlapping portions of text, only the concept corresponding to the longest portion is extracted.

Feature options

Feature options apply to explainable extraction experiments and are described in the Feature options tab.

  • Window size (in tokens) to the left of the token being predicted

    Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) of the predicted token.

  • Window size (in tokens) to the right of the token being predicted

    This parameter specifies the number of tokens to consider to the right (the subsequent) of the predicted token.

  • Minimum document frequency

    Minimum number of documents in which a feature (e.g. a concept, a lemma, an exact word) must be present in order to include it in a rule.

  • Raw word form

    When enabled, exact words can be used in rules to match text words.

  • Word base form (Lemma)

    When enabled, base forms (lemmas) can be used in rules to match the corresponding attribute of text words.

  • Word Part-of-Speech

    When enabled, part-of-speech (for example noun, verb, etc.) can be used in rules to match the corresponding attribute of text words.

  • Syncons

    When enabled, Knowledge Graph concepts (syncons) can be used in rules to match the concept expressed by text words.

  • Ancestors

    When enabled, Knowledge Graph concepts (syncons) that, in an ISA hierarchy, are the ascendant of a given concept can be used in rules to match concepts expressed by text words.

  • Numeric words

    When enabled, numeric words (like 500) can be used in rules to match text words.

  • Use suffix of a word

    When enabled, word suffixes can be used in rules to match text words suffixes.

  • Use prefix of a word

    When enabled, word prefixes—also called stems or roots—can be used in rules to match text words prefixes.

Fine tuning

Explainable categorization and bootstrapped Studio project experiments create models that can use JavaScript to extend and control the document analysis pipeline.
Fine tuning is performed in the Fine tuning tab with the onCategorizer event handling function, which is automatically invoked after categorization rules have been evaluated.

The Fine tuning tab is available in both explainable categorization and bootstrapped Studio project experiments. In the first case, fine tuning can be configured only if the Enable "onCategorizer" optimization parameter in the Generic parameters step of the experiment wizard is turned on.

  • Desired Clean level

    Categorization results clean-up is performed with the CLEAN function.
    The value of this parameter affects the value argument of that function: if set to auto in explainable categorization experiments, the fine tuning algorithm iteratively guesses the best value to use starting from the value of the Default clean level parameter (see below). This parameter is also available in bootstrapped Studio projects.

  • Default clean level

    Doesn't apply to bootstrapped Studio project experiments.
    Initial value for the value argument of the CLEAN function when Desired clean level (see above) is set to auto.

    Note

    When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored.
    Also, if Enable conservative clean (see below) is enabled, cleanup is skipped.

  • Desired filter sequence

    The value to use for the filters argument of the FILTER function. It can be set to auto in explainable categorization experiments which means that the fine tuning algorithm iteratively guesses the best value starting from the value of the Default filter sequence parameter (see below). This parameter is also available in bootstrapped Studio projects.

  • Default filter sequence

    Doesn't apply to bootstrapped Studio project experiments.
    Initial value for the filters argument of the FILTER function when Desired filter sequence (see above) is set to auto.

    Note

    When the Enable "single label" mode parameter in the Generic parameters step of the experiment wizard is turned on, this parameter is ignored and value 100 is used to keep only the category with the highest score.

  • Enable conservative clean

    Doesn't apply to bootstrapped Studio project experiments.
    If no category exceeds the clean level (see above), cleanup is not performed.

  • Max number of documents to be considered by the optimization algorithm

    Doesn't apply to bootstrapped Studio project experiments.
    Maximum number of documents used in the fine tuning process. Value -1 means no limit.

Rules selection

Rules selection parameters apply to explainable extraction experiments and they are described in the Rules selection tab. They affect the fine tuning of generated rules.
Rules are fine tuned by validating them against a subset of the training set, ranking them and selecting those with the highest scores.

  • Fine-tuning rules selecting only the most significant ones

    When turned on, the rule validation and selection step is performed and all the other fine tuning parameters (see below) can be set.

  • Number of rules selection steps

    Number of iterations in the rules validation and selection step.

  • Fraction of validation split

    Percentage of the training set that is used by the validation and selection step.

  • Activate rules pruning

    When enabled, the number of selected rules to keep and include in the model can be set with the Max number of rules to select parameter (see below).

  • Max number of rules to select

    Maximum number of rules to keep after validation and selection. Rules are counted by scrolling through the list of selected rules in descending score order. This parameter can be set only if Activate rules pruning is enabled (see above).

F-Beta

F-Beta parameters apply to all experiments except:

  • Explainable extraction
  • Bootstrapped Studio project
  • Thesaurus generation

They are described in the F-Beta tab.

F-Beta is more general way of computing the F-score. F-beta parameters affect the balance between precision and recall when computing F-Measure at the end of the test phase of the experiment.

F-Beta parameters are:

  • Enable F-Beta optimization (tuning balance between precision and recall): when turned on, it is possible to set the Target F-Beta parameter.
  • Target F-Beta: value 1 gives the same weight to precision and recall, values lower than 1 give more weight to precision while values greater than 1 give more weight to recall.

Auto ML parameters

These parameters apply to Auto-ML experiments when you turn on Automatic features selection in the Feature space tab or Activate Auto-ML on every parameter in the Hyperparameters tab. Such parameters are described in the Auto ML parameters tab.
In that case, Platforms trains a ML model that then uses to predict the best features and best hyperparameters' values to use when actually training the experiment model.

This assistant model is trained iteratively by passing through its training data multiple times. Its parameters are:

  • Number of training iterations for the AutoML algorithm: maximum number of self-tuning iterations.
  • Number of data splits for cross-validation of AutoML algorithm: number of subdivisions of training data.
  • Call back function for stopping the AutoML self-tuning process: early iteration termination policy. The stop can occur when a high score is reached, when a time limit is exceeded or when a combination of a good score and elapsed time occurs.
  • Target time deadline for the AutoML call back stop function (minutes): time limit beyond which the self-tuning algorithm stops iterating if a time-based early termination policy has been chosen (see the parameter above).

Layout information

The Analysis strategy for documents with layout information parameter applies to Studio experiments.

It determines how to manage graphical layout information than can be present in test documents. Test documents can have layout information if they originate form PDF files that were imported with the PDF document view option enabled.

Possible values are:

  • Require layout information: only documents with layout information are analyzed.
  • Rely on expert.ai Extract layout information where available: all documents are analyzed. The model will leverage layout information—when present—if it has been programmed to do so.
  • Plain text: all documents are analyzed and layout information is ignored, only the plain text is fed to the model.

Summary

Use the Summary tab to set the matching strategy parameter, the latter applied to extraction and thesaurus experiments, and to review all the previous parameters you set.

The value of the parameter (Strict, Ignore value or Ignore position) determines the strategy used to compute experiment metrics.

Note

The Summary tab is available for all experiments.

General

These parameters apply to thesaurus generation experiments and are described in the General tab.

  • Generate labels rules

    When turned on, labels and context terms are considered when generating concept extraction rules.

  • Generate knowledge sources rules

    When turned on, labels deriving from linked knowledge sources are considered when generating concept extraction rules.

  • Generate linked projects rules

    When turned on, labels deriving from other projects—of which you have visibility—are considered when generating concept extraction rules.

  • Generate advanced rules

    When turned on, advanced rules are considered when generating concept extraction rules.

  • Generate kill lists rules

    When turned on, prevent concept extractions based on the project kill lists.

Scoring

These parameters apply to thesaurus generation experiments and are described in the Scoring tab.
Thesaurus projects based on NL Core 4.9 or later generate explainable models with extraction rules that attribute a confidence score to extracted concepts. Rules' score can be overridden by a JavaScript scoring algorithm, embedded in the model, which behaves according to the parameters listed below.

Note

In an NL Flow workflow, the values of the parameters can be changed by setting specific options in the input JSON.

Below Scoring type, the following options are available:

  • No scoring: when turned on, the score is not calculated, no tab parameters are available and the Post-processing tab is grayed out.
  • Thesaurus based (labels and relations): when turned on, the generated model executes the JavaScript algorithm that overrides the confidence scores attributed by extraction rules.
  • Document based (positions and frequencies): when turned on, the score is based on the frequency and the position of the mentions of the concepts in the text (see formula below before its parameters).
  • Documents, thesaurus and matches based: when turned on, the score is calculated with the TF-IDF algorithm. If selected, the inverse document frequencies for terms need to be provided in IDF values in the form of an object. For example:
{"investment":5.2,"interest":8.7,"stock":3.4,"dividend":6.1,"portfolio":9.3,"asset allocation":2.9,"equity":7.5,"capital gains":4.6,"bond":8.2,"liquidity":2.3,"mutual fund":6.7,"market value":5.8,"fixed income":3.1,"risk management":9.6,"hedge fund":7.3,"credit rating":4.9,"financial planner":6.5,"pension":3.8,"retirement account":8.9,"401(k)":5.6,"debt":2.7,"budget":9.2,"savings account":4.3,"tax deduction":7.9,"inflation":3.6,"insurance":9.1,"credit score":2.4,"real estate":8.4,"net worth":6.3,"cash flow":5.5,"economic indicators":4.7,"asset management":7.7,"leverage":3.9,"dollar cost averaging":8.6,"compound interest":6.9,"credit card":2.8,"recession":9.7,"solvency":5.9,"taxable income":4.2,"bankruptcy":7.2 ...}

If Thesaurus based (labels and relations) is selected, the following parameters are available:

  • Frequency boosting: enable

    When turned on, the base score for all extractions is the concept frequency in the text. This setting is alternative to Default score value.

  • Default score value

    default base score for all extractions. Ignored if Frequency boosting: enable is turned on. Value 0 means no confidence score is attributed to extracted concepts.

  • Normalization Value

    the final score is normalized to a value in the range between 0 and the value of this parameter. Use value 0 to disable normalization.

  • Hierarchy boosting: by parent

    Multiplication factor applied to the base score based on the relationship between the extracted concept and other extracted concepts. This factor is applied for every broader concept that is also extracted from the text. Value 0 is interpreted as no multiplication.

  • Hierarchy boosting: by children

    Multiplication factor applied to the base score based on the relationship between the extracted concept and other extracted concepts. This factor is applied for every narrower concept that is also extracted from the text. Value 0 is interpreted as no multiplication.

  • Hierarchy boosting: by related

    Multiplication factor applied to the base score based on the relationship between the extracted concept and other extracted concepts. This factor is applied for every non-hierarchically related concept that is also extracted from the text. Value 0 is interpreted as no multiplication.

  • Label boosting: preferred label matched

    Multiplication factor applied to the base score if the matching text is the preferred label. Affected by Label boosting: ignore case.

  • Label boosting: alternative label matched

    Multiplication factor applied to the base score if the matching text is one of alternative labels. Affected by Label boosting: ignore case.

  • Label boosting: length measure

    Multiplication factor applied to the base score corresponding to the number of tokens—separated by space—of the match. Affected by Label boosting: ignore case.

  • Label boosting: ignore case

    When turned on, the case is ignored when matching the text and the labels of the concept. This settings affects:

    • Label boosting: preferred label matched
    • Label boosting: alternative label matched
    • Label boosting: length measure

If Document based (positions and frequencies) is selected, the formula to calculate the score is:

where:

  • posc is the zero-based start position in the text of the extraction.
  • lt is the length of the text in characters.
  • posB is a parameter (see below).
  • k is a parameter (see below).
  • ec is the number of extraction for the concept.
  • emax is the number of mentions of the concept that has more extractions.
  • b is a parameter (see below).
  • et is the total number of extractions.

If this parameter is selected, the following parameters are available:

  • K

    Positive tuning parameter to normalize the document term frequency. Value between 0 and 1—0.5 by default—where 0 means pure relative frequency.

  • B

    Positive tuning parameter that determines the scaling by document length. Value between 0 and 1—0.5 by default—where 1 corresponds to fully scaling the term weight by the document length, and 0 corresponds to no length normalization.

  • avgE

    Average number of extractions per document in the document set. Number greater than 0, 1 by default.

  • posB

    Position bias. It boosts the score if the extraction appears in a specific part of the document. Number between 0 and 1, 0.25 by default.

Post-processing

These parameters apply to thesaurus generation experiments and are described in the Post-processing tab. If No scoring is selected below Scoring type in the Scoring tab, the panel in focus is grayed out.

  • Score threshold

    Score threshold under which concepts are discarded. Number greater than 0 (default value), where 0 disables the parameter.

  • Narrower score thresholds

    This parameter deletes broader concepts when narrower ones score better than the given threshold. Number greater than 0 (default value), where 0 disables the parameter.

  • Boost results in specific sections

    If concepts are extracted within project sections, multiply their score with the following format:

    sectionName1=scoreMultiplier1[, sectionName2=scoreMultiplier2, sectionName#=scoreMultiplier#]

    where:

    • sectionName# is the section name.
    • scoreMultiplier# is the section score multiplier.

    This format initially shows default values. Your project must have defined sections for the score boost to take effect.