Prepare original documents
It is sometimes necessary to pre-process input documents before analyzing them.
For example, if the input document is the result of OCR, a find-and-replace operation can fix misinterpretations like lowercase "l" letter exchanged for digit "1". Also, if dealing with social media messages, it may be useful to replace abbreviations and acronyms with words that facilitate linguistic analysis.
Document pre-processing can be performed by an external process being run before the text intelligence engine or by the text intelligence engine itself using the
onPrepare scripting function.
Every text intelligence engine you produce and deploy with Studio will invoke the
onPrepare function each time a document is submitted to it and before text analysis. That function is consequently the right place to put script code that manipulates the text to improve the subsequent analysis.
Interactive analysis commands like Analyze, however, do not trigger the
onPrepare function, and they act on the files inside the
test folder considering them as already prepared. So, in order to simulate pre-processing, use the document preparation procedure described below.
Put or create original documents in the
You can organize the files in sub-folders.
In the Project window, select the files and/or folders you want to pre-process. If you select the
documentsfolder, all its contents will be pre-processed.
Right-click any of the selected items and select Prepare Selection. If your selection includes sub-folders of the
documentsfolder, they are re-created in the
testfolder already contains files and/or folders with the same name and location as items you have prepared, the items in the
testfolder are overwritten. If you are interested in keeping them, then, make a backup copy.
Or, for a single document:
- Open a document in the editor.
- Right-click the editing area and select Prepare Document.
The outcome of the operation will be displayed in the Output panel of the Console tool window.
If two ore more documents were prepared, a report will be produced too and it will be accessible through the Report tool window.
You can also prepare different types of JSON files:
JSON files with the output structure of the NL Flow Extract Converter processor, whose preparation will provide the:
Input JSON files for NL Flow models in which sections have been defined, whose preparation will provide the:
.txtfile in the test folder with the textual content of the JSON. -
.annfile containing the annotations of the document sections.
JSON files with the output structure of the NL Flow TikaTesseract Converter processor, whose preparation will provide the
.txtfile in the test folder with the textual content of the JSON.