Document exchange archives
Archive structure
Platform's authoring application uses ZIP archives to exchange documents' texts and annotations. When you download documents, the application puts the texts1 of the documents together with any annotations into a ZIP file—the exchange archive—that you can keep as a backup or upload in another project on the same or on a different installation of the application.
The internal structure of an exchange archive is as follows:
test
ann
The test
folder contains UTF-8 encoded plain text files, each containing the text of a document.
The ann
folder is present if there is at least one annotation for at least one document and contains the annotations files.
The correspondence between the text file of a document and its annotation file is on the file name excluding the extension, for example:
test
doc1.txt
doc2.txt
doc3.txt
...
ann
doc2.ann
...
In this case doc2.ann
is the annotations file for document doc2.txt
. As you can see, there are no annotation files for documents doc1.txt
and doc3.txt
.
Note
You don't have to worry about the ZIP archive structure when exporting documents, the export procedure automatically takes care of structuring the archive.
The authoring application can also import documents from ZIP files that have a different structure than exchange archives. In this case the archive is simply considered as a container of files which can be in any of the formats supported by the upload procedure.
Annotation files
An annotation file is an UTF-8 encoded text file in brat standoff format.
It contains all the annotations of a document, which can be:
- Expected categorization results, i.e. categories
- Expected results of information extraction, i.e. class values
- Sections of text such as titles
Category annotations
Category annotations are written like these:
C1 15000000
C2 20001117
The number after C
is the sequential number of the annotation. The code after the tab is the category id.
Extraction annotations
Extraction annotations are written like this:
T1 Ingredients.Legumes 116 123 lentils
The number after T
is the sequential number of the annotation.
Ingredients
is the name in the group, Legumes
is the name of the class.
116
is the zero based position of the first character of the value in the text, 122
is the position of the first character after the value. lentils
is the class value to extract.
For ungrouped classes, the name of the group must be replaced with the name of the class.
If the name of a group or of a class is composed exclusively of digits, it must be prefixed with constant string X_
.
Section annotations
Section annotations are written like this:
T1 _SECTION 0 85 TITLE
The number after T
is the sequential number of the annotation.
_SECTION
is a string constant indicative of a section annotation. 0
and 85
are, respectively, the zero based positions in the text of the first character of the section and of the first character after the section. TITLE
is the name of the section.
-
That is, not the original files that were uploaded, like for example PDF files, Microsoft Word files, etc. ↩