Skip to content

Text subdivision

The text subdivision process is the part of the deep linguistic analysis that detects text structure in terms of:

  • Paragraphs
  • Sentences
  • Phrases
  • Tokens
  • Atoms

During this process, the phrase type is also determined.

A token can be:

  • A collocation, a sequence of consecutive words recognized as a unit, like credit card or red carpet.
  • A single word
  • A punctuation mark

By definition, an atom is something that cannot be further divided. The term is used here to indicate the single words that compose a token.

As an example of text subdivision, consider this text:

Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
Michael Jordan was also a baseball player and an actor.

It gets divided in two paragraphs:

1. Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
2. Michael Jordan was also a baseball player and an actor.

The first paragraph is divided in two sentences:

1. Michael Jordan was one of the best basketball players of all time.
2. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.

The first sentence is divided in six phrases:

1. Michael Jordan
2. was
3. one
4. of the best basketball players
5. of all time
6. .

The fourth phrase is divided into four tokens:

1. of
2. the
3. best
4. basketball players

Since (in the case of single-word tokens) atoms and tokens coincide, atoms are returned only for collocations, so the fourth token is divided in two atoms:

1. basketball
2. player

For each subdivision the process returns:

  • The position
  • The reference to the lower level constituent subdivisions

Text subdivision output is part of the JSON object returned by deep linguistic analysis.