Skip to content

Experiment parameters

The parameters that can be set in the categorization experiment and the extraction experiment wizards are described below.

Categorization experiments

Auto-ML Categorization

Problem definition

Parameter Description
Enable strict "single label" mode When turned on, the model always predicts one (and only one) category for each document. When off, the model can detect any number of categories—multi label predictions—or no category.
Default: off.
Enable strict "Sub document categorization" compatibility mode When turned on, the model can predicts categories over protions of the document (sub document categorization).
Default: off.
Annotation and testing for sub document categorization must be performed with external tools, contact expert.ai technical support for more information.

Warning

If the Enable strict "single label" mode parameter is turned on, F-Beta parameters will not be available.

Feature space

The main parameter for this step of the experiment wizard is Automatic features selection, which is turned on by default. This means that Platform will automatically determine the document features to use to train the model.
If turned off, the following parameters can be set to either Use or Don't use:

Parameter Description of the corresponding feature Default
Word form Word exactly as written in the text Use
Word base form (Lemma) Base form of a word i.e. its lemma, for example run for words like running or ran Use
Main lemma Document-level most representative lemmas Don't use
Word base form stem Stem of a word, for example intern for international Use
Sub-words Parts of a word like morphemes, stems and endings, roots Use
Entities Named entitites derived from the text, like people, places and organizations Use
Syncons Meaning of words determined by disambiguation, as to work outto exercise Use
Main Syncons Document-level most representative syncons Use
Syncon Topics Generalized main subjects being discussed (e.g. "mammal" as a concept in "the tiger is a mammal" has topic "zoology") Use
Main Topics Document-level most representative topics Use
Knowledge Label Pre-defined parent syncon (e.g. "legal action" is the knowledge label for "moratorium") Don't use
Knowledge Graph relations Attribute the hierarchical relation nodes as added meaning to the word (e.g. "dentist" is also "medical specialist", "doctor", "professional", etc.) Use
Use word embeddings Static word embeddings Don't use

Hyperparameters

Hyperparameters are model specific. Here follows the list.

Linear SVM and Probabilistic SVM
Parameter Description
SVM C parameter: penalty for misclassifications The C parameter is a regularization (or generalization) parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization.
Possible values:
- 0.001
- 0.01
- 0.05
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1 (by default)
- 5
Class weight Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model.
Possible values:
- Balanced (by default)
- None
Custom Kernel SVM
Parameter Description
Custom kernel type to be applied The kernel function to select to represent the feature.
Possible values:
- Polynomial kernel (by default)
- Sigmoid kernel
Inverse of regularization strength Penalty for misclassification. Regularization parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset, which may not be representative, it is better to select an large value to avoid errors.
Possible values:
- 0.001
- 0.01
- 0.05
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1 (by default)
- 5
Degree of polynomial for polynomial kernel Available if the parameter Custom kernel type to be applied is set to Polynomial kernel. The parameter sets the degree of the polynomial kernel function.
Possible values: number between 2 and 7. (3 by default)
Stop condition tolerance Value indicating how much the error must decrease before the early stopping, which is the end of the algorithm iterations earlier than planned. Large values mean less iteration.
Possible values: number between 0.000001 and 0.1. (0.0001 by default)
Class weight Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model.
Possible values:
- Balanced (by default)
- None
SGD
Parameter Description
SGD alpha regularization parameter Regularization parameter on training data. It is used to prevent overfitting and underfitting. Larger values set a stronger regularization.
Possible values:
- 0.0001
- 0.001
- 0.01
- 0.1 (by default)
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1
Class weight Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model.
Possible values:
- Balanced (by default)
- None
GBoost
Parameter Description
Number of trees Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible overfitting. On the other hand a low value implies an high level of generalization and this involves the risk of not predicting some classes and underfitting.
Possible values: number between 20 and 500. (100 by default)
Learning rate The rate of adaptation of the model to learning, that is how quickly the error tends to decrease. This parameter allows you to decide whether to favor speed or accuracy. Large values mean high speed of learning, but this could imply an ineffective model because the learning is too fast and couldn't consider all the features involved. On the other hand, if the learning path of the model is less rapid, it means taking more time but having a more effective model. In other words this parameter shrinks the contributions of each decision tree.
Possible values: number between 0.000001 and 1. (0.1 by default).
Tolerance for early stopping Value indicating how much the error must decrease before the early stopping that is the end of the algorithm iterations earlier than expected. Large values mean less iteration.
Possible values: number between 0 and 1. (0.0001 by default)
N. of iteration with no change Parameter used as early stopping criterion. During training, if the score hasn’t been improved since the last iterations count, the training stops.
Possible values: number between -1 and 10. (-1 = no changes, set by default)
XGBoost
Parameter Description
Number of trees Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization, generalization and a possible overfitting. On the other hand a low value implies an high level of generalization and this involves the risk of not predicting some classes.
Possible values: number between 20 and 500. (100 by default)
Learning rate The rate of adaptation of the model to learning, that is how quickly the error tends to decrease. This parameter allows you to decide whether to favor speed or accuracy. High values mean high speed, but this could imply an ineffective models because the learning is too fast and couldn't consider all the features involved. On the other hand, if the learning path of the model is less rapid, it means taking more time but having a more effective model. In other words this parameter shrinks the contributions of each decision tree.
Possible values: number between 0.000001 and 1. (0.1 by default).
L2 regularization term on weights Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent overfitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When the parameter L2 regularization term on weight is equal to 0 no regularization is applied.
Possible values: number between 0 and 1. (0 by default)
L1 regularization term on weights Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent overfitting. When the parameter L1 regularization term on weight is equal to 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients.
Possible values: number between 0 and 1. (0 by default)
Random Forest
Parameter Description
Number of decision trees Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization, generalization and a possible overfitting. On the other hand a low value implies an high level of generalization (underfitting) and this involves the risk of not predicting some classes.
Possible values: number between 20 and 500. (100 by default)
Split criterion on tree nodes The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree.
Possible values:
- Gini impurity (by default)
- Entropy.
Class weight Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model.
Possible values:
- Balanced (by default)
- None
Logistic Regression
Parameter Description
Optimization problem algorithm The algorithm to use in the optimization problem.
Possible values:
- newton-cg
- lbfgs (by default)
- sag
- saga
Inverse of regularization strength Penalty for misclassification. Regularization parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset which may not be representative, it is better to select an large value to avoid errors.
Possible values:
- 0.001

- 0.01
- 0.1 (by default)
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1
- 5
Stop condition tolerance Value indicating how much the error must decrease before the early stopping that is the end of the algorithm iterations earlier than expected. High values mean less iteration.
Possible values: number between 0.000001 and 0.1. (0.0001 by default)
Class weight Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model.
Possible values:
- Balanced (by default)
- None
Multinomial Naive Bayes
Parameter Description
Alpha regularizer Regularization parameter, smoothing factor on term counts. Large values increase the regularization.
Possible values:
- 0
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1 (by default)
Complement Naive Bayes
Parameter Description
Alpha regularizer Regularization parameter, smoothing factor on term counts. Large values increase the regularization.
Possible values:
- 0
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1 (by default)
Normalize: penalize long documents to avoid their dominance in stats Long documents are discarded to balance the statistics.
Possible values:
- on
- off (by default)

F-Beta

To know more about F-Beta read this article.

Parameter Description
Enable F-Beta optimization (tuning balance between precision and recall) Disabled by default. If enabled it is possible to set the Target F-Beta parameter.
Target F-Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score.
Possible values: number between 0 and 2. (1 by default)

Auto ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter Description
Number of training iterations for the AutoML algorithm
Possible values: number between 20 and 100 (30 by default)
Number of data splits for cross-validation of AutoML algorithm
Possible values: number between 2 and 10 (3 by default)
Call back function for stopping the AutoML self-tuning process Possible values:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes)
Possible values: number between 15 and 120 (30 by default). Disabled if Stop based on best scoring evaluation in Call back function for stopping the AutoML self-tuning process is selected.

Explainable Categorization

Generic parameters

Parameter Description
Enable "onCategorizer" optimization Enable or disable the scripting included in the onCategorizer function event handler. Enabled by default
Enable "strict" hierarchical mode If enabled this parameter ensures the consistency of a strict hierarchical prediction related to the categories that belong to the project taxonomy, that is if the prediction is a child, it is predicted also the strict hierarchical chain (parents and ancestor). For example if the analysis detects that the document topic is cat , then it is forced also the prediction of its strict hierarchical chain that in a project taxonomy could be feline, mammals, vertebrates, animals and not, for example, feline and animals. Disabled by default
Enable "single label" mode If this parameter is switched-on the model considers just the first prediction in score and is set as the only winner, that is the model predicts "one and only one" category. If switched-off, the model predicts from 0 to N categories depending on rules and scripting. For example could be dependent on Filter( ) and Clean( ) functions used to manage the categories score in the onCategorizer event handler function. Disabled by default

Note

If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters will not be available.

Rules Generation

The rules generated by the Explainable categorization engine use mainly the following attributes:

  • LEMMA
  • SYNCON
  • ANCESTOR
  • KEYWORD
  • PATTERN

If the analyzed text matches with the Knowledge Graph entries, the engine tends to generate by default combination of LEMMA and/or SYNCON and/or ANCESTOR attribute types. If the analyzed text doesn't match with the Knowledge Graph entries also KEYWORD and PATTERN attributes are used.

Parameter Description
Enable generation of syncon based rules Generation of rules based on SYNCON attribute is enabled. By default the engine tends to generate light rules starting from LEMMA and/or SYNCON attribute types, that's why this parameter is enabled by default. If disabled also the parameter Enable generation of ancestor based rules will be disabled.
Enable generation of ancestor based rules Generate rules with the ANCESTOR attribute is heavier and influence the training speed, thus this parameter is disabled by default. Disabled also if Enable generation of syncon based rules parameter is disabled.
Max number of items in each rule The maximum number of groups in a rule separated by an AND operator.
Possible values: number between 2 and 4. (3 by default)
Max number of rules for each taxonomy category Maximum number of generated rules to consider that refer to a specific category. Parameter used to reduce the verbosity of the language project. A large number of generated rules implies a performing model but a project difficult to maintain.
Possible values: number between 5 and 1000. (200 by default)
Min number of annotated documents for a category, to enable rules generation The minimum number of annotated document in the training set that enables the rules generation.
Possible values: number between 2 and 1000. (5 by default)
Max number of rules in which any single item can participate This parameter determines the number of rules in which a single textual element (that could be a syncon or a lemma or a keyword) can be used. In other word it is a limit to the rules hypergeneration with a single textual element.
Possible values: number between 2 and 200. (40 by default)

Fine tuning

Categorization "onCategorizer" optimization hyperparameters

Parameters displayed only if Enable "onCategorizer" optimization parameter in Generic parameters is switched-on.

Parameter Description
Desired Clean level The value to set in the CLEAN function (auto by default. auto = automatic)
Default Clean level (initial value if auto is selected) The parameter set by default if Desired Clean level is set to auto. (10 by default). If Enable "single label" mode in Generic parameters is switched-on, the parameter is set to 0 that means all the categories are considered.
Desired Filter sequence The values to set in the FILTER function (auto by default. auto = automatic)
Default Filter sequence (initial value if auto is selected) The parameter set by default if Desired Filter level is set to auto. (40, 80, 90, 90, 90 sequence set by default). The algorithm starts with the default value sequence and then during the iterations the sequence could be improved by the engine itself. If Enable "single label" mode in Generic parameters is switched-on, the value is set to 100 that means just one class is considered.
Enable conservative Clean If no class exceeds the value set in Default Clean level (initial value if auto is selected), the class/the classes with the highest score is/are considered the winner. Disabled by default
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) Parameter useful to limit the documents to be considered by the optimization algorithm in order to improve the speed. Number greater than -1 (set by default)

F-Beta

Precision and recall balance

Parameter Description
Target F-Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score.
Possible values: number between 0 and 2. (0.75 by default)

Bootstrapped Studio Project

Rules Generation

Parameter Description
Enable generation of syncon based rules Generation of rules based on SYNCON attribute is enabled. By default the engine tends to generate light rules starting from LEMMA and/or SYNCON attribute types, that's why this parameter is enabled by default. If disabled also the parameter Enable generation of ancestor based rules will be disabled.
Enable generation of ancestor based rules Generate rules with the ANCESTOR attribute is heavier and influence the training speed, thus this parameter is disabled by default. Disabled also if Enable generation of syncon based rules parameter is disabled.
Max number of items in each rule The maximum number of groups in a rule separated by an AND operator.
Possible values: number between 1 and 3 (2 by default)
Max number of rules for each taxonomy category Maximum number of generated rules to consider that refer to a specific category. Parameter used to reduce the verbosity of the language project. A large number of generated rules implies a performing model but a project difficult to maintain.
Possible values: number between 5 and 50. (20 by default)
Min number of annotated documents for a category, to enable rules generation The minimum number of annotated document in the training set that enables the rules generation.
Possible values: number between 2 _and _100. (5 by default)
Max number of rules in which any single item can participate Determines the number of rules in which a single textual element (that could be a syncon or a lemma or a keyword) can be used. In other word it is a limit to the rules hypergeneration with a single textual element.
Possible values: number between 1 and 3. (2 by default)

Fine tuning

Parameter Description
Desired Clean level The value to set in the CLEAN function (10 by default)
Desired Filter sequence The values to set in the FILTER function. (Sequence 40, 80, 90 by default)

Extraction experiments

Auto-ML Extraction

Feature space

If Automatic features selection is switched-off it is possible to set the status of the following parameters:

Parameter Description
Word base form (Lemma) Base form of a word (lemma) (for example run for "running" or "ran")
Logic dependencies Relationships and dependencies of the word (for example "subject" – "relationship type" – "object" relationships)
Word Part-of-Speech Part-of-speech of a word (for example noun, verb, etc.)
Collocations Combination of words frequently used together which have a specific meaning (e.g. "regular exercise" or "to take a risk")
Phrases Combination of words that together create a singular meaning (e.g. "to look after" or "on the table")
Syncons Conceptual meaning of a word or phrase (for example "to work out" means "to exercise")
Known Concepts Specific meaning of a term that is available in the Expert.ai Knowledge Graph (for example "Italy" is a specific country, "World Cup" is a specific football tournament)
Entities Entities (for example persons, organizations, etc.)
Knowledge Graph relations Attribute the hierarchical relation nodes as added meaning to the word (e.g. "dentist" is also "medical specialist", "doctor", "professional", etc.). Set on Don't use by default
Title case words Toggle title case as a feature in the word vector
Upper case word Toggle uppercase as a feature in the word vectors
Digit words Toggle digit as a feature in the word vector
Mixed case words Toggle mixed case as a feature in the word vector
Alpha Numeric words Toggle alpha numeric as a feature in the word vector. Set on Use by default
Alphabetic words Toggle alphabetic as a feature in the word vector
Numeric words Toggle numeric as a feature in the word vector
Decimal number words Toggle decimal number as a feature in the word vector

Hyperparameters

The following parameters are CRF model specific only.

Parameter Description
CRF c1 regularization coefficient Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value.
Possible values:
- 0
- 0.00001
- 0.0001
- 0.001 (by default)
- 0.05
- 0.1
- 0.3
- 0.5
- 0.8
- 1
- 2
- 5
- 10
- 100
CRF c2 regularization coefficient Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms.
Possible values:
- 0.00001
- 0.000
-
0.001 (by default)
-
0.1
-
0.3
-
0.
-
2

- 5
- 10
- 100

F-Beta

Parameter Description
Enable F Beta optimization (tuning balance between precision and recall) Disabled by default. If enabled it is possible to set the Target F Beta parameter.
Target F Beta A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score.
Possible values: number between 0 and 5. (1 by default)

Auto-ML parameters

Machine Learning automatic self-tuning process parameters

AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.

Parameter Description
Number of training iterations for the AutoML algorithm
Possible values: number between 20 and 100. (30 by default)
Number of data splits for cross-validation of AutoML algorithm
Possible values: number between 2 and 10. (3 by default)
Call back function for stopping the AutoML self-tuning process Possible values:
- Stop based on best scoring evaluation
- Stop based on total time
- Stop based on both best scoring evaluation and total time (by default)
Target time deadline for the AutoML call back stop function (minutes)
Possible values: number between 15 and 120. (30 by default)

Note

If you select Stop based on best scoring evaluation, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.

Explainable Extraction

Rules generation

Support, confidences and tolerance parameters

Parameter Description
Maximum number of conditions for any given rule Parameter that determines how many conditions to use in a rule. For example the following part of a rule:
@field[ATTRIBUTE_01] operator ATTRIBUTE_02
is composed by two conditions defined by ATTRIBUTE_01 and ATTRIBUTE_02 and by the operator.
Possible values: a number between 1 and 5. (3 by default)
Enable automatic minimum support setup Enable the automatic minimum support. If switched-on the engine automatically set the value. Minimum support is the number of times a rule must match in a training set to be generated. Switched-on by default
Custom minimum support threshold Minimum support value to enter manually only if the Enable automatic minimum support setup parameter is switched off. Number greater than 2. (5 by default)
Enable automatic minimum support (NOTE: It must be read as confidence) setup Enable the automatic minimum confidence. If switched-on the engine automatically set the value. Confidence is the number of times a rule matches in the class target context to be generated. Switched-on by default Note: ideally a "good" rule triggers many times (an high support level) and always and only in the class target context (an high confidence level).
Custom minimum confidence to explore a rule Minimum confidence value to enter manually only if the previous parameter is switched off.
Possible values: number between 0.001 and 0.2. (0.05 by default)
Minimum acceptance confidence threshold Minimum threshold to determine that a rule is acceptable. Small values mean greater acceptance and implies a greater final project recall.
Possible values: number between 0.2 and 0.95. (0.6 by default)
Minimum confidence improvement for adding a new condition to a rule The confidence improvement (delta) obtained by adding a condition to the rule that leads to the acceptance of the rule itself.
Possible values: number between 0.01 and 0.2. (0.01 by default)
Enable concatenation of contiguous extractions Enable the concatenations of contiguous extractions if the extractions are composed of multiple adjacent tokens. Switched-off by default

Feature options

Parameter Description
Window size (in tokens) to the left of the token being predicted Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) to the predicted token.
Possible values: number between 0 and 5. (3 by default)
Window size (in tokens) to the right of the token being predicted This parameter specifies the number of tokens to consider to the right (the subsequent) to the predicted token.
Possible values: number between 0 and 5. (3 by default)
Minimum document frequency Number of documents in which the feature (for example a keyword) must be present in order to generate rules. Number greater than 1 (2 by default)
Raw word form The word itself. Switched-on by default
Word base form (Lemma) Base form of a word (lemma) (for example "run" for "running" or "ran"). Switched-on by default by default
Word Part-of-Speech Part-of-speech of a word (for example noun, verb, etc.). Switched-on by default
Syncons Conceptual meaning of a word or phrase (for example "to work out" means "to exercise"). Switched-on by default
Ancestors More abstract concepts related to syncons. Switched-on by default
Numeric words Enable the usage of numeric tokens (for example Aspirina 500 mg, 500 is the numeric token) as a feature in order to generate rules. Switched-off by default
Use suffix of a word Enable the usage of tokens suffix (for example -ina 500 mg) as a feature in order to generate rules. Switched-off by defaultSwitched-off by default by default
Use prefix of a word Enable the usage of tokens prefix (for example Aspi-) as a feature in order to generate rules. Switched-off by defaultSwitched-off by default by default

Rules selection

Options for selecting best rules

The rules generated are selected and optimized according to the following parameters.

Parameter Description
Fine-tuning rules selecting only the most significant ones If switched-off all the other parameters in the Options for selecting best rules wizard page are disabled. Switched-on by default
Number of rules selection steps Steps to perform in order select the best generated rules in validation (that is a data subset).
Possible values: number between 20 and 100. (50 by default)
Fraction of validation split Number of documents that are not used to generate rules but they are used in validation data subset.
Possible values: number between 0.1 and 0.9. (0.2 by default)
Activate rules pruning If switched-on enables the Max number of rules to select parameter. Switched-off by default
Max number of rules to select Parameter that specifies the number of best rules (according to a performance evaluation algorithm) on a single class to be saved. Beyond this number the rules are pruned (removed). Enabled only if the previous parameter is switched-on.
Possible values: number between 1 and 1000. (100 by default)