Skip to content

Prepare original documents

Need to prepare documents

When launching an analysis, Studio always analyzes files with the .txt extension contained in the test folder, possibly using corresponding annotation files (.ann) and layout files (.lay) contained in the ann folder.
These files can be created from within Studio or externally and then copied into the project's folders mentioned above.

However, Studio can create test files by preparing source files stored in the documents folder. The recognized formats are:

  1. Plain text files with the .txt extension.
  2. Files with the .json extension containing JSON with the structure of the output of the NL Flow TikaTesseract Converter processor.
  3. Files with the .json extension containing JSON with the structure of the output of the NL Flow Extract Converter processor.
  4. Files with the .json extension containing JSON with both the text and the sections properties corresponding to input variables of the same name of NL Flow model blocks.
  5. Legacy formats (CogX, TestBench and Cogito XML).

In all cases, during preparation, Studio extracts the plain text from the source file and stores it in a file with the .txt extension in the test folder.
At this stage, Studio executes the JavaScript onPrepare function, if defined, setting its text parameter with the text extracted from the file being prepared. Studio IDE takes the value returned by the function and writes it in the .txt file that it then stores in the test folder.
This is a way to make changes to the original text, it can be useful to pre-process the text.
For example, if the original text is the result of OCR, the onPrepare function can perform a find-and-replace operation to fix misinterpretations like lowercase "l" letter exchanged for digit "1".
In another example, onPrepare can be used to replace abbreviations and acronyms with words that facilitate linguistic analysis.

The onPrepare function is not executed during analysis, only during preparation. This is intentional, to make it easier to interpret analysis results against the test files, which reacquires them to not be modified during the analysis itself.
However, the function is executed on the input text by the text intelligence engine produced by deploying the project.
Document preparation, therefore, also serves to simulate this operation that the engine will perform and see the results in the files created in the test folder, ready for analysis.

In case 3, Studio also produces a file with the .lay extension that is stored in the ann folder. This file allows assigning these attributes to the text tokens during analysis:

which can then be used in rule conditions.

In case 4, Studio also produces a file with the .ann extension that is stored in the ann folder. The file contains section annotations. In case 5, it can behave like case 3 if the source file contains section information.

Note: since importing a library produces files in the documents folder, these will need to be prepared to become test files.

Procedure

Document preparation commands in the Studio IDE work on all or part of the contents of the project's documents folder and, as mentioned above, output prepared documents in the project's test folder.
If an entire folder is prepared, its structure gets replicated inside the test folder.

Put or create original documents in the documents folder, possibly importing them. You can organize the files in sub-folders.

  1. In the Project window, select the files and/or folders whose documents you want to prepare. If you select the documents folder, all its contents will be processed.
  2. After that, right-click any of the selected items and select Prepare Selection.

Or, for a single document:

  1. Open a document in the editor.
  2. Right-click the editing area and select Prepare Document.

Warning

If the test folder already contains files and/or folders with the same name and location as items you have prepared, the items in the test folder get overwritten. If you are interested in keeping them, then, make a backup copy before preparing the documents.

A notification of the operation will be displayed in the Output panel of the Console tool window. If two or more documents were prepared, a report will be produced too and it will be accessible through the Report tool window.