Document exchange archives

Archive structure

Platform's authoring application uses ZIP archives to exchange documents' texts and annotations. When you download documents, the application puts the texts¹ of the documents together with any annotations into a ZIP file—the exchange archive—that you can keep as a backup or upload in another project on the same or on a different installation of the application.

The internal structure of an exchange archive is as follows:

test
ann

The test folder contains UTF-8 encoded plain text files, each containing the text of a document.

The ann folder is present if there is at least one annotation for at least one document and contains the annotations files.
The correspondence between the text file of a document and its annotation file is on the file name excluding the extension, for example:

test
    doc1.txt
    doc2.txt
    doc3.txt
    ...
ann
    doc2.ann
    ...

In this case doc2.ann is the annotations file for document doc2.txt. As you can see, there are no annotation files for documents doc1.txt and doc3.txt.

Note

You don't have to worry about the ZIP archive structure when exporting documents, the export procedure automatically takes care of structuring the archive.

The authoring application can also import documents from ZIP files that have a different structure than exchange archives. In this case the archive is simply considered as a container of files which can be in any of the formats supported by the upload procedure.

Annotation files

An annotation file is an UTF-8 encoded text file in brat standoff format.
It contains all the annotations of a document, which can be:

Expected categorization results, i.e. categories
Expected results of information extraction, i.e. class values
Sections of text such as titles

Category annotations

Category annotations are written like these:

C1  15000000
C2  20001117

The number after C is the sequential number of the annotation. The code after the tab is the category id.

Extraction annotations

Extraction annotations, suitable for extraction and thesaurus projects, are written like this:

T1  Ingredients.Legumes 116 123 lentils

The number after T is the sequential number of the annotation.
Ingredients is the name in the group, Legumes is the name of the class.
116 is the zero based position of the first character of the value in the text, 122 is the position of the first character after the value. lentils is the class value to extract.

For ungrouped classes, the name of the group must be replaced with the name of the class.

If the name of a group or of a class is composed exclusively of digits, it must be prefixed with constant string X_.

Section annotations

Section annotations are written like this:

T1  _SECTION 0 85   TITLE

The number after T is the sequential number of the annotation.
_SECTION is a string constant indicative of a section annotation. 0 and 85 are, respectively, the zero based positions in the text of the first character of the section and of the first character after the section. TITLE is the name of the section.

That is, not the original files that were uploaded, like for example PDF files, Microsoft Word files, etc. ↩