Skip to content

Text subdivision

The text subdivision process is the part of the deep linguistic analysis that detects text structure in terms of:

  • Paragraphs
  • Sentences
  • Phrases
  • Tokens
  • Atoms

During this process, the phrase type is also determined.

A token can be:

  • A collocation, that is a sequence of consecutive words recognized as a unit, like credit card or red carpet.
  • A single word
  • A punctuation mark

By definition, an atom is something that cannot be further divided. In the case of single words or punctuation marks, the atoms coincide with the tokens, while in the case of collocations, for each token of that type there will be as many atoms as there are words that make up the token.

As an example of text subdivision, consider this text:

Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
Michael Jordan was also a baseball player and an actor.
    ```

It gets divided in two paragraphs:

``` text
1. Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.
2. Michael Jordan was also a baseball player and an actor.

The first paragraph is divided in two sentences:

1. Michael Jordan was one of the best basketball players of all time.
2. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.

The first sentence is divided in six phrases:

1. Michael Jordan
2. was
3. one
4. of the best basketball players
5. of all time
6. .

The fourth phrase is divided into four tokens:

1. of
2. the
3. best
4. basketball players

Since atoms and tokens coincide except in the case of collocations, atoms are returned only in that case, so the fourth token is divided in two atoms:

1. basketball
2. player

For each subdivision the process returns:

  • The position
  • The reference to the lower level constituent subdivisions

Text subdivision output is part of the JSON object returned by deep linguistic analysis.