Skip to content

Extractions from documents

Introduction

The model produced with a thesaurus project extracts the occurrences of the concepts from documents.

Use the Extraction configuration tab of the Edit Concept panel to change the extraction settings.

Toggle extraction

Use the Extraction toggle switch to turn concept extraction on and off.
This can be useful to see how the generated model works without a certain resource.

Extraction method

In the EXTRACTION METHOD area you set the method Platform will use to determine the portions of text to extract.

Possible methods are:

  • Semantic: Platform extracts all the portions of text expressing the same meaning of the concept labels, in any inflected form. For example, for label sandglass: sandglass, hourglass, sandglasses, hourglasses.
  • Base form: Platform extracts all the inflections of the lemma—the base form, for example the dictionary entry—of the concept labels. For example, for label sandglass: sandglass, sandglasses.
  • Exact label: Platform extracts text portions that literally match concept labels.

Co-occurrence constraints

In MANDATORY CONTEXT TERMS and FORBIDDEN CONTEXT TERMS area you can put terms that, respectively, must be present or must not be present in the context set inside the CONTEXT SETTINGS area for the extraction to take place.

For example, you may want the concept of chair to be extracted only if the term president is not also present in the same paragraph. In this case see add president to column FORBIDDEN CONTEXT TERMS and set the context to Paragraph.

  • To add a term, select the plus button below the column header, type the term and press Enter.
  • To edit a term, hover over it and select Edit .
  • To delete a term, hover over it and select Delete .
  • To change the co-occurrence context, choose your option in CONTEXT SETTINGS.

Clause context

Not all parts of a text correspond to clauses. For example, titles such as:

Disclaimer

are not considered propositions.
Be aware that if you set Clause as context, there may be portions of the document text that contain expressions of the concept that are not extracted.

Forbidden forms

In the case of extraction with semantic or base form methods, Platform extracts all the inflected forms of the concept labels.
If you want some forms to be ignored, add them to the FORBIDDEN FORMS column.

  • To add a form, select the plus button below the column header, type the form and press Enter.
  • To edit a form, hover over it and select Edit .
  • To delete a form, hover over it and select Delete .