Prepare original documents
Need to prepare documents
When launching an analysis, Studio always analyzes files with the .txt
extension contained in the test
folder, possibly using corresponding annotation files (.ann
) and layout files (.lay
) contained in the ann
folder.
These files can be created from within Studio or externally and then copied into the project's folders mentioned above.
However, Studio can create test files by preparing source files stored in the documents
folder. The recognized formats are:
- Plain text files with the
.txt
extension. - Files with the
.json
extension containing JSON with the structure of the output of the NL Flow TikaTesseract Converter processor. - Files with the
.json
extension containing JSON with the structure of the output of the NL Flow Extract Converter processor. - Files with the
.json
extension containing JSON with both thetext
and thesections
properties corresponding to input variables of the same name of NL Flow model blocks. - Legacy formats (CogX, TestBench and Cogito XML).
In all cases, during preparation, Studio extracts the plain text from the source file and stores it in a file with the .txt
extension in the test
folder.
At this stage, Studio executes the JavaScript onPrepare
function, if defined, setting its text
parameter with the text extracted from the file being prepared. Studio IDE takes the value returned by the function and writes it in the .txt
file that it then stores in the test
folder.
This is a way to make changes to the original text, it can be useful to pre-process the text.
For example, if the original text is the result of OCR, the onPrepare
function can perform a find-and-replace operation to fix misinterpretations like lowercase "l" letter exchanged for digit "1".
In another example, onPrepare
can be used to replace abbreviations and acronyms with words that facilitate linguistic analysis.
The onPrepare
function is not executed during analysis, only during preparation. This is intentional, to make it easier to interpret analysis results against the test files, which reacquires them to not be modified during the analysis itself.
However, the function is executed on the input text by the text intelligence engine produced by deploying the project.
Document preparation, therefore, also serves to simulate this operation that the engine will perform and see the results in the files created in the test
folder, ready for analysis.
In case 3, Studio also produces a file with the .lay
extension that is stored in the ann
folder. This file allows assigning these attributes to the text tokens during analysis:
which can then be used in rule conditions.
In case 4, Studio also produces a file with the .ann
extension that is stored in the ann
folder. The file contains section annotations.
In case 5, it can behave like case 3 if the source file contains section information.
Note: since importing a library produces files in the documents
folder, these will need to be prepared to become test files.
Procedure
Document preparation commands in the Studio IDE work on all or part of the contents of the project's documents
folder and, as mentioned above, output prepared documents in the project's test
folder.
If an entire folder is prepared, its structure gets replicated inside the test
folder.
Put or create original documents in the documents
folder, possibly importing them. You can organize the files in sub-folders.
- In the Project window, select the files and/or folders whose documents you want to prepare. If you select the
documents
folder, all its contents will be processed. - After that, right-click any of the selected items and select Prepare Selection.
Or, for a single document:
- Open a document in the editor.
- Right-click the editing area and select Prepare Document.
Warning
If the test
folder already contains files and/or folders with the same name and location as items you have prepared, the items in the test
folder get overwritten. If you are interested in keeping them, then, make a backup copy before preparing the documents.
A notification of the operation will be displayed in the Output panel of the Console tool window. If two or more documents were prepared, a report will be produced too and it will be accessible through the Report tool window.