Sections
Introduction
Some types of documents have a structure consisting of multiple sections.
For example:
- E-mail messages:
- Subject
- Body
- Sender
- "To" recipients
- "CC" recipients
- Attachments
- Newspaper articles:
- Title
- Byline
- Lead
- Body
- Scientific papers:
- Title
- Abstract
- Keywords
- Main text
- References
Taking sections into account in categorization and extraction projects may be important or even necessary. For example, an extraction project may require the extraction of data only from a given section, while a categorizer could perform better if more importance is given to the title text.
Expert.ai languages allow you to declare expected sections, use them as the scope for rules and give them a weight in correspondence to their importance which is then automatically used by the scoring algorithm.
However, the disambiguator analyzes plain text, so when the text of a structured document is copied and pasted, unstructured text will be returned with no indication of where sections begin and end. It would seem that the sections are totally lost in the text extraction.
In reality, there are ways to preserve and provide sections information to the disambiguator.
Sections are automatically detected when the original document—an e-mail message, a PDF file, an XML file, etc.—is processed to obtain its text. If the document processor is programmed for this purpose, it could also locate the sections and output their boundaries as side-by-side information in a format the disambiguator can understand.
When the plain text is coupled with this information the disambiguator is able to recognize portions of text as belonging to a section or another.
Side-by-side information is used at runtime by the text intelligence engine in the production environment, but, before that, it can be used in the development environment to set up and test sections management in the project.
In the same environment, plain text test files can be manually annotated to indicate the start and end of sections. These annotations are also stored as side-by-side information in special project files.
Sections information is provided from the outside, so sections can be considered as predefined areas of text. Segments, on the other hand, are dynamically created "in memory" at runtime by specific instructions of expert.ai languages, based on features of the plain text alone.
Declaration
In order to be used, sections must be declared in a rules file.
The syntax is:
SECTIONS
{
declaration[,
declaration,
...]
}
where declaration
is defined as:
name[(options)]
name
must match the section name specified in the side-by-side information and must be preceded by the at sign (@
). Multiple options are separated by commas.
For example:
SECTIONS
{
@BODY(STANDARD,1SCORE),
@TITLE(2SCORE)
}
There are several options:
STANDARD
qualifier#SCORE
multiplier#SHORT:#LONG
multiplier#SHORT:#LONG:#INITIAL:#SHORT:#LONG
multiplier
Note
The STANDARD
qualifier can be combined with all multipliers, as seen in the example above, indicating scores to be multiplied and attributed to the default defined section.
STANDARD qualifier
The STANDARD
qualifier marks the default section. This section is the implicit "higher level scope" for rules whose scope does not reference a section.
For example, this rule:
SCOPE SENTENCE
{
DOMAIN(dom1)
{
TYPE(NPH, NPR)
}
}
has SENTENCE
as its declared scope, therefore only sentences in the default section will be considered when evaluating the rule.
Only one section can have the STANDARD
qualifier.
SCORE multiplier
The #SCORE
multiplier option has this syntax:
#SCORE
where #
can be 0 or any positive integer. The score multiplier affects only categorization rules.
Whenever a categorization rule is triggered by the text of a section with this option set, the rule's score will be multiplied by the value of #
.
The default value is 1, so if no score multiplier option is specified, the rule's score will not be changed.
When #
is 0, the rule's score will be multiplied by 0, becoming 0, as if the rule was not triggered.
In this way the text of the section will be excluded from the categorization, as if it doesn't exist.
Positive values of #
are used to give a "boost" to the score, thus attributing more relevance to rules' hits when they occur in the section. They are usually specified for heading sections such as titles.
SHORT:LONG multiplier
The #SHORT:#LONG
multiplier has this syntax:
#SHORT:#LONG
where #
can be 0 or any positive integer. The score multiplier affects only categorization rules.
Note
For a description of #
, see the SCORE paragraph.
Using these two attributes means making a distinction between long and short documents. In this case, two multiplicative factors are defined, one for each document type.
Note
A document is considered long if it is longer than 100 words, short if it has 100 words or less.
Consider this example:
SECTIONS
{
@BODY(STANDARD,1SCORE),
@TITLE(2SCORE),
@MYSECTION(3SHORT:2LONG)
}
In this case, any categorization rule triggered within the MYSECTION section will have the score multiplied by 3 if it contains 100 or less words, or by 2 if it contains more than 100 words.
Note
This configuration is useful if the length of the text being part of a section is variable. This can affect the score assigned to a category, which can be lower if a text is shorter (less words and less rule hits) and higher if a text is longer (more words and more rule hits). The aim of defining two different score attributes is to restore some balance in the final categorization score.
SHORT:LONG:INITIAL:SHORT:LONG multiplier
The #SHORT:#LONG:#INITIAL:#SHORT:#LONG
multiplier has this syntax:
#SHORT:#LONG:INITIAL:#SHORT:#LONG
where #
can be 0 or any positive integer. The score multiplier affects only categorization rules.
Note
For a description of #
, see the SCORE paragraph.
Unlike the #SHORT:#LONG
multiplier described above, there are two distinctions to make:
- The first between “long” and “short” documents, already described in the
#SHORT:#LONG
multiplier. - The second between what is positioned at the beginning of a document and what further down.
For this reason, four distinct multiplicative factors are defined, one for each type of SCORE
attribute.
The fifth attribute, INITIAL
, is used to define the number of sentences from the beginning of the section considered as the initial one.
Consider this example:
SECTIONS
{
@BODY(STANDARD,1SCORE),
@TITLE(2SCORE),
@MYSECTION(3SHORT:2LONG:3INITIAL:4SHORT:4LONG)
}
In the example above, a multiplicative factor of 2 has been assigned to the MYSECTION section if this contains a long document. A multiplicative factor of 3 has been assigned to the section if this contains a short document. This means that any categorization rule triggered within the MYSECTION section will have the score multiplied by 3 if it contains 100 words or less or by 2 if it contains more than 100 words. The multiplicative factor assigned to the first three sentences of the section (3INITIAL
) is 4 for both short and long documents.
Sections as rules' scope
Sections can be specified as the scope of categorization and extraction rules. Rules with a section scope are triggered only by the text of the specified section.
For example, this rule:
SCOPE SECTION(TITLE)
{
IDENTIFY(VIP)
{
@NAME[TYPE(NPH)]
}
}
matches and extracts people's names from the TITLE
section only, and places them in the NAME
field of VIP
records.
Implicit section
Internally, any expert.ai-based text intelligence engine requires sections, meaning that it expects that any given text will always belong to some section.
On the other hand, original documents may not be structured and/or categorization and extraction projects may not need sections. Therefore users will not be required to declare and use sections, if they are not needed.
The solution to this apparent contradiction is the implicit section.
When a user decides to omit the sections declaration, the engine will work as if a BODY
section with the STANDARD
qualifier was declared and hence, all the text was included in the BODY
section.
In other words, if sections are not required for a project, they could be ignored and the project will work as expected. However, if they are used:
- If
BODY
section is declared, this declaration will override the implicit declaration. - If a
BODY
section is not declared because the documents do not contain one, then, any text outside sections will be considered as part of the implicitBODY
section.