Experiment parameters
The parameters that can be set in the categorization experiment and the extraction experiment wizards are described below.
Categorization experiments
Auto-ML Categorization
Problem definition
Parameter | Description |
---|---|
Enable strict "single label" mode | When turned on, the model always predicts one (and only one) category for each document. When off, the model can detect any number of categories—multi label predictions—or no category. Default: off. |
Enable strict "Sub document categorization" compatibility mode | When turned on, the model can predicts categories over protions of the document (sub document categorization). Default: off. Annotation and testing for sub document categorization must be performed with external tools, contact expert.ai technical support for more information. |
Warning
If the Enable strict "single label" mode parameter is turned on, F-Beta parameters will not be available.
Feature space
The main parameter for this step of the experiment wizard is Automatic features selection, which is turned on by default. This means that Platform will automatically determine the document features to use to train the model.
If turned off, the following parameters can be set to either Use or Don't use:
Parameter | Description of the corresponding feature | Default |
---|---|---|
Word form | Word exactly as written in the text | Use |
Word base form (Lemma) | Base form of a word i.e. its lemma, for example run for words like running or ran | Use |
Main lemma | Document-level most representative lemmas | Don't use |
Word base form stem | Stem of a word, for example intern for international | Use |
Sub-words | Parts of a word like morphemes, stems and endings, roots | Use |
Entities | Named entitites derived from the text, like people, places and organizations | Use |
Syncons | Meaning of words determined by disambiguation, as to work out → to exercise | Use |
Main Syncons | Document-level most representative syncons | Use |
Syncon Topics | Generalized main subjects being discussed (e.g. "mammal" as a concept in "the tiger is a mammal" has topic "zoology") | Use |
Main Topics | Document-level most representative topics | Use |
Knowledge Label | Pre-defined parent syncon (e.g. "legal action" is the knowledge label for "moratorium") | Don't use |
Knowledge Graph relations | Attribute the hierarchical relation nodes as added meaning to the word (e.g. "dentist" is also "medical specialist", "doctor", "professional", etc.) | Use |
Use word embeddings | Static word embeddings | Don't use |
Hyperparameters
Hyperparameters are model specific. Here follows the list.
Linear SVM and Probabilistic SVM
Parameter | Description |
---|---|
SVM C parameter: penalty for misclassifications | The C parameter is a regularization (or generalization) parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset, which may not be representative, it is better to select a large value to avoid the possible errors related to a strong regularization. Possible values: - 0.001 - 0.01 - 0.05 - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 (by default) - 5 |
Class weight | Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model. Possible values: - Balanced (by default) - None |
Custom Kernel SVM
Parameter | Description |
---|---|
Custom kernel type to be applied | The kernel function to select to represent the feature. Possible values: - Polynomial kernel (by default) - Sigmoid kernel |
Inverse of regularization strength | Penalty for misclassification. Regularization parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset, which may not be representative, it is better to select an large value to avoid errors. Possible values: - 0.001 - 0.01 - 0.05 - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 (by default) - 5 |
Degree of polynomial for polynomial kernel | Available if the parameter Custom kernel type to be applied is set to Polynomial kernel. The parameter sets the degree of the polynomial kernel function. Possible values: number between 2 and 7. (3 by default) |
Stop condition tolerance | Value indicating how much the error must decrease before the early stopping, which is the end of the algorithm iterations earlier than planned. Large values mean less iteration. Possible values: number between 0.000001 and 0.1. (0.0001 by default) |
Class weight | Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model. Possible values: - Balanced (by default) - None |
SGD
Parameter | Description |
---|---|
SGD alpha regularization parameter | Regularization parameter on training data. It is used to prevent overfitting and underfitting. Larger values set a stronger regularization. Possible values: - 0.0001 - 0.001 - 0.01 - 0.1 (by default) - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 |
Class weight | Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model. Possible values: - Balanced (by default) - None |
GBoost
Parameter | Description |
---|---|
Number of trees | Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization (or generalization) and a possible overfitting. On the other hand a low value implies an high level of generalization and this involves the risk of not predicting some classes and underfitting. Possible values: number between 20 and 500. (100 by default) |
Learning rate | The rate of adaptation of the model to learning, that is how quickly the error tends to decrease. This parameter allows you to decide whether to favor speed or accuracy. Large values mean high speed of learning, but this could imply an ineffective model because the learning is too fast and couldn't consider all the features involved. On the other hand, if the learning path of the model is less rapid, it means taking more time but having a more effective model. In other words this parameter shrinks the contributions of each decision tree. Possible values: number between 0.000001 and 1. (0.1 by default). |
Tolerance for early stopping | Value indicating how much the error must decrease before the early stopping that is the end of the algorithm iterations earlier than expected. Large values mean less iteration. Possible values: number between 0 and 1. (0.0001 by default) |
N. of iteration with no change | Parameter used as early stopping criterion. During training, if the score hasn’t been improved since the last iterations count, the training stops. Possible values: number between -1 and 10. (-1 = no changes, set by default) |
XGBoost
Parameter | Description |
---|---|
Number of trees | Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization, generalization and a possible overfitting. On the other hand a low value implies an high level of generalization and this involves the risk of not predicting some classes. Possible values: number between 20 and 500. (100 by default) |
Learning rate | The rate of adaptation of the model to learning, that is how quickly the error tends to decrease. This parameter allows you to decide whether to favor speed or accuracy. High values mean high speed, but this could imply an ineffective models because the learning is too fast and couldn't consider all the features involved. On the other hand, if the learning path of the model is less rapid, it means taking more time but having a more effective model. In other words this parameter shrinks the contributions of each decision tree. Possible values: number between 0.000001 and 1. (0.1 by default). |
L2 regularization term on weights | Regularization lambda parameter. Ridge regularization parameter, used to reduce complexity and prevent overfitting. The cost function is altered by adding a penalty term equals to square magnitude of the coefficients. The penalty term lambda regularizes large coefficients by penalizing the cost function. When the parameter L2 regularization term on weight is equal to 0 no regularization is applied. Possible values: number between 0 and 1. (0 by default) |
L1 regularization term on weights | Regularization alpha parameter. Lasso regularization parameter, used to reduce complexity and prevent overfitting. When the parameter L1 regularization term on weight is equal to 0, no regularization is applied. The cost function is altered by adding a penalty term equal to the magnitude of the coefficients. Possible values: number between 0 and 1. (0 by default) |
Random Forest
Parameter | Description |
---|---|
Number of decision trees | Decision trees number to generate to make a decision. This value impacts on the data regularization. A high value implies a low regularization, generalization and a possible overfitting. On the other hand a low value implies an high level of generalization (underfitting) and this involves the risk of not predicting some classes. Possible values: number between 20 and 500. (100 by default) |
Split criterion on tree nodes | The split criterion to be applied on each node of the tree, that is the criterion to choose the branch to follow in the decision tree. Possible values: - Gini impurity (by default) - Entropy. |
Class weight | Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model. Possible values: - Balanced (by default) - None |
Logistic Regression
Parameter | Description |
---|---|
Optimization problem algorithm | The algorithm to use in the optimization problem. Possible values: - newton-cg - lbfgs (by default) - sag - saga |
Inverse of regularization strength | Penalty for misclassification. Regularization parameter on training data. It is used to prevent overfitting and underfitting. If you have a big dataset and you consider it representative, the regularization parameter value should be small to increase the regularization and force the model to be tailored on training data. On the contrary if you have a small dataset which may not be representative, it is better to select an large value to avoid errors. Possible values: - 0.001 - 0.01 - 0.1 (by default) - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 - 5 |
Stop condition tolerance | Value indicating how much the error must decrease before the early stopping that is the end of the algorithm iterations earlier than expected. High values mean less iteration. Possible values: number between 0.000001 and 0.1. (0.0001 by default) |
Class weight | Regularization parameter on the class balancing. For example in a training model where one class is preponderant over all the others, selecting the value Balanced forces a re-weighting to prevent unbalanced predictions for less represented classes. If the training model is highly representative, choosing Balanced makes the model a little less performing than it can be if you prefer the None value. If, on the other hand, the training model is not very representative and None is selected, the result is very a poor model. Possible values: - Balanced (by default) - None |
Multinomial Naive Bayes
Parameter | Description |
---|---|
Alpha regularizer | Regularization parameter, smoothing factor on term counts. Large values increase the regularization. Possible values: - 0 - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 (by default) |
Complement Naive Bayes
Parameter | Description |
---|---|
Alpha regularizer | Regularization parameter, smoothing factor on term counts. Large values increase the regularization. Possible values: - 0 - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1 (by default) |
Normalize: penalize long documents to avoid their dominance in stats | Long documents are discarded to balance the statistics. Possible values: - on - off (by default) |
F-Beta
To know more about F-Beta read this article.
Parameter | Description |
---|---|
Enable F-Beta optimization (tuning balance between precision and recall) | Disabled by default. If enabled it is possible to set the Target F-Beta parameter. |
Target F-Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Possible values: number between 0 and 2. (1 by default) |
Auto ML parameters
Machine Learning automatic self-tuning process parameters
AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.
Parameter | Description |
---|---|
Number of training iterations for the AutoML algorithm | Possible values: number between 20 and 100 (30 by default) |
Number of data splits for cross-validation of AutoML algorithm | Possible values: number between 2 and 10 (3 by default) |
Call back function for stopping the AutoML self-tuning process | Possible values: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time deadline for the AutoML call back stop function (minutes) | Possible values: number between 15 and 120 (30 by default). Disabled if Stop based on best scoring evaluation in Call back function for stopping the AutoML self-tuning process is selected. |
Explainable Categorization
Generic parameters
Parameter | Description |
---|---|
Enable "onCategorizer" optimization | Enable or disable the scripting included in the onCategorizer function event handler. Enabled by default |
Enable "strict" hierarchical mode | If enabled this parameter ensures the consistency of a strict hierarchical prediction related to the categories that belong to the project taxonomy, that is if the prediction is a child, it is predicted also the strict hierarchical chain (parents and ancestor). For example if the analysis detects that the document topic is cat , then it is forced also the prediction of its strict hierarchical chain that in a project taxonomy could be feline, mammals, vertebrates, animals and not, for example, feline and animals. Disabled by default |
Enable "single label" mode | If this parameter is switched-on the model considers just the first prediction in score and is set as the only winner, that is the model predicts "one and only one" category. If switched-off, the model predicts from 0 to N categories depending on rules and scripting. For example could be dependent on Filter( ) and Clean( ) functions used to manage the categories score in the onCategorizer event handler function. Disabled by default |
Note
If Enable "onCategorizer" optimization is disabled, the Fine tuning parameters will not be available.
Rules Generation
The rules generated by the Explainable categorization engine use mainly the following attributes:
LEMMA
SYNCON
ANCESTOR
KEYWORD
PATTERN
If the analyzed text matches with the Knowledge Graph entries, the engine tends to generate by default combination of LEMMA
and/or SYNCON
and/or ANCESTOR
attribute types.
If the analyzed text doesn't match with the Knowledge Graph entries also KEYWORD
and PATTERN
attributes are used.
Parameter | Description |
---|---|
Enable generation of syncon based rules | Generation of rules based on SYNCON attribute is enabled. By default the engine tends to generate light rules starting from LEMMA and/or SYNCON attribute types, that's why this parameter is enabled by default. If disabled also the parameter Enable generation of ancestor based rules will be disabled. |
Enable generation of ancestor based rules | Generate rules with the ANCESTOR attribute is heavier and influence the training speed, thus this parameter is disabled by default. Disabled also if Enable generation of syncon based rules parameter is disabled. |
Max number of items in each rule | The maximum number of groups in a rule separated by an AND operator. Possible values: number between 2 and 4. (3 by default) |
Max number of rules for each taxonomy category | Maximum number of generated rules to consider that refer to a specific category. Parameter used to reduce the verbosity of the language project. A large number of generated rules implies a performing model but a project difficult to maintain. Possible values: number between 5 and 1000. (200 by default) |
Min number of annotated documents for a category, to enable rules generation | The minimum number of annotated document in the training set that enables the rules generation. Possible values: number between 2 and 1000. (5 by default) |
Max number of rules in which any single item can participate | This parameter determines the number of rules in which a single textual element (that could be a syncon or a lemma or a keyword) can be used. In other word it is a limit to the rules hypergeneration with a single textual element. Possible values: number between 2 and 200. (40 by default) |
Fine tuning
Categorization "onCategorizer" optimization hyperparameters
Parameters displayed only if Enable "onCategorizer" optimization parameter in Generic parameters is switched-on.
Parameter | Description |
---|---|
Desired Clean level | The value to set in the CLEAN function (auto by default. auto = automatic) |
Default Clean level (initial value if auto is selected) | The parameter set by default if Desired Clean level is set to auto. (10 by default). If Enable "single label" mode in Generic parameters is switched-on, the parameter is set to 0 that means all the categories are considered. |
Desired Filter sequence | The values to set in the FILTER function (auto by default. auto = automatic) |
Default Filter sequence (initial value if auto is selected) | The parameter set by default if Desired Filter level is set to auto. (40, 80, 90, 90, 90 sequence set by default). The algorithm starts with the default value sequence and then during the iterations the sequence could be improved by the engine itself. If Enable "single label" mode in Generic parameters is switched-on, the value is set to 100 that means just one class is considered. |
Enable conservative Clean | If no class exceeds the value set in Default Clean level (initial value if auto is selected), the class/the classes with the highest score is/are considered the winner. Disabled by default |
Max number of documents to be considered by the optimization algorithm (value of -1 is no limit) | Parameter useful to limit the documents to be considered by the optimization algorithm in order to improve the speed. Number greater than -1 (set by default) |
F-Beta
Precision and recall balance
Parameter | Description |
---|---|
Target F-Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Possible values: number between 0 and 2. (0.75 by default) |
Bootstrapped Studio Project
Rules Generation
Parameter | Description |
---|---|
Enable generation of syncon based rules | Generation of rules based on SYNCON attribute is enabled. By default the engine tends to generate light rules starting from LEMMA and/or SYNCON attribute types, that's why this parameter is enabled by default. If disabled also the parameter Enable generation of ancestor based rules will be disabled. |
Enable generation of ancestor based rules | Generate rules with the ANCESTOR attribute is heavier and influence the training speed, thus this parameter is disabled by default. Disabled also if Enable generation of syncon based rules parameter is disabled. |
Max number of items in each rule | The maximum number of groups in a rule separated by an AND operator. Possible values: number between 1 and 3 (2 by default) |
Max number of rules for each taxonomy category | Maximum number of generated rules to consider that refer to a specific category. Parameter used to reduce the verbosity of the language project. A large number of generated rules implies a performing model but a project difficult to maintain. Possible values: number between 5 and 50. (20 by default) |
Min number of annotated documents for a category, to enable rules generation | The minimum number of annotated document in the training set that enables the rules generation. Possible values: number between 2 _and _100. (5 by default) |
Max number of rules in which any single item can participate | Determines the number of rules in which a single textual element (that could be a syncon or a lemma or a keyword) can be used. In other word it is a limit to the rules hypergeneration with a single textual element. Possible values: number between 1 and 3. (2 by default) |
Fine tuning
Parameter | Description |
---|---|
Desired Clean level | The value to set in the CLEAN function (10 by default) |
Desired Filter sequence | The values to set in the FILTER function. (Sequence 40, 80, 90 by default) |
Extraction experiments
Auto-ML Extraction
Feature space
If Automatic features selection is switched-off it is possible to set the status of the following parameters:
Parameter | Description |
---|---|
Word base form (Lemma) | Base form of a word (lemma) (for example run for "running" or "ran") |
Logic dependencies | Relationships and dependencies of the word (for example "subject" – "relationship type" – "object" relationships) |
Word Part-of-Speech | Part-of-speech of a word (for example noun, verb, etc.) |
Collocations | Combination of words frequently used together which have a specific meaning (e.g. "regular exercise" or "to take a risk") |
Phrases | Combination of words that together create a singular meaning (e.g. "to look after" or "on the table") |
Syncons | Conceptual meaning of a word or phrase (for example "to work out" means "to exercise") |
Known Concepts | Specific meaning of a term that is available in the Expert.ai Knowledge Graph (for example "Italy" is a specific country, "World Cup" is a specific football tournament) |
Entities | Entities (for example persons, organizations, etc.) |
Knowledge Graph relations | Attribute the hierarchical relation nodes as added meaning to the word (e.g. "dentist" is also "medical specialist", "doctor", "professional", etc.). Set on Don't use by default |
Title case words | Toggle title case as a feature in the word vector |
Upper case word | Toggle uppercase as a feature in the word vectors |
Digit words | Toggle digit as a feature in the word vector |
Mixed case words | Toggle mixed case as a feature in the word vector |
Alpha Numeric words | Toggle alpha numeric as a feature in the word vector. Set on Use by default |
Alphabetic words | Toggle alphabetic as a feature in the word vector |
Numeric words | Toggle numeric as a feature in the word vector |
Decimal number words | Toggle decimal number as a feature in the word vector |
Hyperparameters
The following parameters are CRF model specific only.
Parameter | Description |
---|---|
CRF c1 regularization coefficient | Regularization coefficient, if set to greater than 0, it enables the Orthant-Wise Limited-memory QuasiNewton (OWL-QN) method as L1 (alpha) regularization value. Possible values: - 0 - 0.00001 - 0.0001 - 0.001 (by default) - 0.05 - 0.1 - 0.3 - 0.5 - 0.8 - 1 - 2 - 5 - 10 - 100 |
CRF c2 regularization coefficient | Regularization coefficient. If a values is set, it enables the L2 (lambda) regularization value of the l2sgd and lbfgs training algorithms. Possible values: - 0.00001 - 0.000 - 0.001 (by default) - 0.1 - 0.3 - 0. - 2 - 5 - 10 - 100 |
F-Beta
Parameter | Description |
---|---|
Enable F Beta optimization (tuning balance between precision and recall) | Disabled by default. If enabled it is possible to set the Target F Beta parameter. |
Target F Beta | A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall. A larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score. Possible values: number between 0 and 5. (1 by default) |
Auto-ML parameters
Machine Learning automatic self-tuning process parameters
AutoML is an algorithm that runs separately from the model being trained. AutoML selects the best combination of Features and Model Parameters to return the best performing model. The parameters on this screen pertain only to the AutoML algorithm.
Parameter | Description |
---|---|
Number of training iterations for the AutoML algorithm | Possible values: number between 20 and 100. (30 by default) |
Number of data splits for cross-validation of AutoML algorithm | Possible values: number between 2 and 10. (3 by default) |
Call back function for stopping the AutoML self-tuning process | Possible values: - Stop based on best scoring evaluation - Stop based on total time - Stop based on both best scoring evaluation and total time (by default) |
Target time deadline for the AutoML call back stop function (minutes) | Possible values: number between 15 and 120. (30 by default) |
Note
If you select Stop based on best scoring evaluation, the Target time deadline for the AutoML call back stop function (minutes) parameter won't be available.
Explainable Extraction
Rules generation
Support, confidences and tolerance parameters
Parameter | Description |
---|---|
Maximum number of conditions for any given rule | Parameter that determines how many conditions to use in a rule. For example the following part of a rule:@field[ATTRIBUTE_01] operator ATTRIBUTE_02 is composed by two conditions defined by ATTRIBUTE_01 and ATTRIBUTE_02 and by the operator.Possible values: a number between 1 and 5. (3 by default) |
Enable automatic minimum support setup | Enable the automatic minimum support. If switched-on the engine automatically set the value. Minimum support is the number of times a rule must match in a training set to be generated. Switched-on by default |
Custom minimum support threshold | Minimum support value to enter manually only if the Enable automatic minimum support setup parameter is switched off. Number greater than 2. (5 by default) |
Enable automatic minimum support (NOTE: It must be read as confidence) setup | Enable the automatic minimum confidence. If switched-on the engine automatically set the value. Confidence is the number of times a rule matches in the class target context to be generated. Switched-on by default Note: ideally a "good" rule triggers many times (an high support level) and always and only in the class target context (an high confidence level). |
Custom minimum confidence to explore a rule | Minimum confidence value to enter manually only if the previous parameter is switched off. Possible values: number between 0.001 and 0.2. (0.05 by default) |
Minimum acceptance confidence threshold | Minimum threshold to determine that a rule is acceptable. Small values mean greater acceptance and implies a greater final project recall. Possible values: number between 0.2 and 0.95. (0.6 by default) |
Minimum confidence improvement for adding a new condition to a rule | The confidence improvement (delta) obtained by adding a condition to the rule that leads to the acceptance of the rule itself. Possible values: number between 0.01 and 0.2. (0.01 by default) |
Enable concatenation of contiguous extractions | Enable the concatenations of contiguous extractions if the extractions are composed of multiple adjacent tokens. Switched-off by default |
Feature options
Parameter | Description |
---|---|
Window size (in tokens) to the left of the token being predicted | Window is a segment (set of tokens) used to find regularity in the text in order to generate rules. This parameter specifies the number of tokens to consider to the left (the precedent) to the predicted token. Possible values: number between 0 and 5. (3 by default) |
Window size (in tokens) to the right of the token being predicted | This parameter specifies the number of tokens to consider to the right (the subsequent) to the predicted token. Possible values: number between 0 and 5. (3 by default) |
Minimum document frequency | Number of documents in which the feature (for example a keyword) must be present in order to generate rules. Number greater than 1 (2 by default) |
Raw word form | The word itself. Switched-on by default |
Word base form (Lemma) | Base form of a word (lemma) (for example "run" for "running" or "ran"). Switched-on by default by default |
Word Part-of-Speech | Part-of-speech of a word (for example noun, verb, etc.). Switched-on by default |
Syncons | Conceptual meaning of a word or phrase (for example "to work out" means "to exercise"). Switched-on by default |
Ancestors | More abstract concepts related to syncons. Switched-on by default |
Numeric words | Enable the usage of numeric tokens (for example Aspirina 500 mg, 500 is the numeric token) as a feature in order to generate rules. Switched-off by default |
Use suffix of a word | Enable the usage of tokens suffix (for example -ina 500 mg) as a feature in order to generate rules. Switched-off by defaultSwitched-off by default by default |
Use prefix of a word | Enable the usage of tokens prefix (for example Aspi-) as a feature in order to generate rules. Switched-off by defaultSwitched-off by default by default |
Rules selection
Options for selecting best rules
The rules generated are selected and optimized according to the following parameters.
Parameter | Description |
---|---|
Fine-tuning rules selecting only the most significant ones | If switched-off all the other parameters in the Options for selecting best rules wizard page are disabled. Switched-on by default |
Number of rules selection steps | Steps to perform in order select the best generated rules in validation (that is a data subset). Possible values: number between 20 and 100. (50 by default) |
Fraction of validation split | Number of documents that are not used to generate rules but they are used in validation data subset. Possible values: number between 0.1 and 0.9. (0.2 by default) |
Activate rules pruning | If switched-on enables the Max number of rules to select parameter. Switched-off by default |
Max number of rules to select | Parameter that specifies the number of best rules (according to a performance evaluation algorithm) on a single class to be saved. Beyond this number the rules are pruned (removed). Enabled only if the previous parameter is switched-on. Possible values: number between 1 and 1000. (100 by default) |