Introduction to the DIS object
The DIS pre-defined objects and its methods
The DIS
pre-defined object gives access to the disambiguation results and also allows you to tag or untag text subdivisions.
The DIS
object can be used in the following functions, which are executed after disambiguation:
The functionalities of the DIS
object are exposed through its methods, which can be grouped into these groups:
- Count text subdivisions
- Get the text of the whole document or that of text subdivisions
- Get objects corresponding to text subdivisions to explore their properties
- Get the index of the text subdivision of a certain kind which contains a character at a given position with respect to the document text
- Tag and untag text subdivisions
Text subdivisions: tokens and atoms
The methods of the DIS
object are based on the creation of text subdivisions—with different granularity—operated by disambiguation.
At the token level, subdivisions also include sub-tokens called "atoms." The disambiguation lists these atoms immediately following the token they are associated with. It's crucial to recognize and account for this feature when iterating through tokens.
For example, given this input text:
Michael Jordan was one of the best basketball players of all time.
disambiguation identifies these 15 units as either tokens or atoms:
Index | Text | Sub-token (atom)? |
---|---|---|
0 | Michael Jordan | No |
1 | Michael | Yes |
2 | Jordan | Yes |
3 | was | No |
4 | one | No |
5 | of | No |
6 | the | No |
7 | best | No |
8 | basketball players | No |
9 | basketball | Yes |
10 | players | Yes |
11 | of | No |
12 | all | No |
13 | time | No |
14 | . | No |