Sections
Sections are parts of a document that you may want to leverage in a language model.
For example, a model that categorizes email messages should give more importance to the text of the subject than to the body or the attachments, and a model that extracts information from contracts works better if it can limit the search for the covenants to the section of the contract dedicated to them.
If a document is represented by its plain text to allow a language model to analyze it, the visual features of sections are missing, so section boundaries are used instead. Section boundaries indicate which text belongs to which section in terms of positions: the position where the section starts and the position where it ends. Hence, when a document gets analyzed with a model that uses sections, the document's text must be accompanied by metadata indicating sections' boundaries. This allows categorization and extraction rules to determine the section a given portion of text belongs to and act accordingly.
Limiting the scope of categorization or extraction rules to sections can only be done manually in Studio, no model that is automatically generated with the Platform authoring application can use sections, it doesn't make a difference if it is Machine Learning, bootstrapped Studio categorization, explainable categorization, explainable extraction or thesaurus1.
Studio models can be loaded in a Platform project, offline or interactively, to test their quality with special experiments that simply use the model instead of generating one. For this reason, Platform projects—categorization, extraction and thesaurus—allow library documents with section annotations, that is the annotation of sections' boundaries. During an experiment, annotated boundaries are passed to the model and section constrained rules can be triggered accordingly.
Documents loaded in a project can have section annotations. When uploading documents from an exchange archive, there may be annotation files and, between the annotations, there may be section annotations.
As an alternative or complement, the sections of a document can be annotated interactively in the Platform GUI. To support this functionality, categorization, extraction and thesaurus projects allow defining the sections the user can choose from. Follow the links below to find out how to define sections as project resources.
How to define sections in:
Note
The upload wizard has options to manage the harmonization of the sections annotated in the archive documents and those defined at project level.
When models that leverage sections are finally used in a NL Flow workflow, again, the text to analyze must be accompanied by section boundaries.
-
However, bootstrapped Studio categorization models, explainable categorization models, explainable extraction models and thesaurus models can be edited with Studio so to manually add provision for sections, then imported back in a Platform project to perform quality tests that can leverage sections' information. ↩