Basic extraction configuration
Introduction
Project models can identify and extract the mentions of the taxonomy concepts from documents.
Use the Extraction tab of the Edit concept panel its toolbar to change the basic extraction settings for a concept.
Advanced extraction settings can be managed in the Advanced extraction tab.
Toggle extraction
Use the Extraction toggle on the panel toolbar to turn extraction on and off.
If extraction is disabled, the concept label in the taxonomy tree is stricken through.
Extraction method
Extraction method determines how project models will use concepts' labels to detect the expressions of the concept in a test and extract them.
Possible values are:
- Semantic: all the portions of text expressing the same meaning of the concept labels, in any inflected form. For example, if the label is sandglass, the model will extract sandglass, hourglass, sandglasses, hourglasses.
- Fuzzy matching: valid for collocations, the method extracts the concept even if the collocation words do not appear in the order in which they are written in the label, as long as they are within a certain maximum distance not established by the user.
- Base form: the label is considered as a lemma (that is the base form or dictionary entry for a term) and all the inflections of it are extracted. For example, if the label is sandglass the model will extract sandglass and sandglasses.
-
Exact label: if the label is lowercase, the model will extract copies of the label written with any case. For example, if the label is triumph, the following can be extracted:
- triumph
- Triumph
- TRIUMPH
- tRiUmPh
- ...
If the label contains one or more uppercase letters, the model will only extract exact copies of the label.
- Exact label same case: the model will extract only exact copies of the label written in the same case.
- Exact label case insensitive: the model will extract copies of the label written with any case.
The default value is the one set at project level.
Forbidden forms
In case of extraction with semantic or base form methods (see above), the model extracts inflected forms of the concept labels.
If you want some forms to be ignored, add them to the FORBIDDEN FORMS column.
-
To add a forbidden form:
- Select the plus button beside Forbidden forms, type the form and press
Enter
. - To the right of the term you see Case: sensitive or Case: insensitive which corresponds to the type of match between the form as you typed it and the text of documents. This initially reflects the corresponding project setting, but you can change it by choosing on the dropdown menu.
- If the project is monolingual, you will see the label for the only language on the far right of the forbidden form box, otherwise you will see the label for the preferred language initially, but you can change the language on the dropdown menu.
- Select the plus button beside Forbidden forms, type the form and press
-
To change a form, just edit it.
- To delete a form, hover over it and select the X icon .
Extraction context
In Platform terminology, a context is a subdivision of the text, for example a paragraph or a sentence.
By establishing the context for concept extraction, you can then specify context constraints that confirm or cancel the extraction, that is, conditions that, when an expression of a concept is identified in the text, must be verified or not verified within the specified context for the extraction to actually take place.
To specify the context for these conditions, make a choice from the options next to Context settings. The default choice is the one set at the project level. Possible choices are:
- Sentence
- Sentence*2: two consecutive sentences, that is the sentence in which the expression of the concept was identified plus any preceding or following sentence
-
Clause: be careful when choosing this value, because the context also determines from which parts of the text the concept is extracted, and while the other choices cover all of the text because everywhere it is always considered to be inside a sentence and a paragraph, some parts of the text are not considered clauses. For example, a heading like:
Disclaimer
may not be considered a clause, so be aware that there may be portions of the document text in which extraction will not take place.
Forbidden context terms and Mandatory context terms
Forbidden context terms are optional terms that, if specified, must not occur together with expressions of the concept in the same extraction context in order for the extraction—or multiple extractions—to take place.
For example, you may want the concept of chair to be extracted only if the term president is not also present in the same paragraph.
On the contrary, Mandatory context terms are optional terms that must occur together with expressions of the concept in the same extraction context.
-
To add a term:
- Select the plus button beside Forbidden context terms or Mandatory context terms, type the term and press
Enter
. - To the right of the term you see Case: default, Case: sensitive or Case: insensitive which corresponds to the type of match between the form as you typed it and the text of documents. Case: default means that if the term is written in lowercase it will match in a case-insensitive way, while if it contains at least one uppercase letter it will match in a case-sensitive way.
This initially reflects the corresponding project setting, but you can change it by choosing on the dropdown menu. - If the project is monolingual, you will see the label for the only language on the far right of the forbidden form box, otherwise you will see the label for the preferred language initially, but you can change the language on the dropdown menu.
- Select the plus button beside Forbidden context terms or Mandatory context terms, type the term and press
-
To change a term, just edit it.
- To delete a term, hover over it and select the X icon .