Prepare original documents
It is sometimes necessary to pre-process input documents before analyzing them.
For example, if the input document is the result of OCR, a find-and-replace operation can fix misinterpretations like lowercase "l" letter exchanged for digit "1". Also, if dealing with social media messages, it may be useful to replace abbreviations and acronyms with words that facilitate linguistic analysis.
Document pre-processing can be performed by an external process being run before the text intelligence engine or by the text intelligence engine itself using the onPrepare
scripting function.
Every text intelligence engine you produce and deploy with Studio will invoke the onPrepare
function each time a document is submitted to it and before text analysis. That function is consequently the right place to put script code that manipulates the text to improve the subsequent analysis.
Interactive analysis commands like Analyze, however, do not trigger the onPrepare
function, and they act on the files inside the test
folder considering them as already prepared. So, in order to simulate pre-processing, use the document preparation procedure described below.
-
Put or create original documents in the
documents
folder.Tip
You can organize the files in sub-folders.
-
In the Project window, select the files and/or folders you want to pre-process. If you select the
documents
folder, all its contents will be pre-processed. -
Right-click any of the selected items and select Prepare Selection. If your selection includes sub-folders of the
documents
folder, they are re-created in thetest
folder.Warning
If the
test
folder already contains files and/or folders with the same name and location as items you have prepared, the items in thetest
folder are overwritten. If you are interested in keeping them, then, make a backup copy.
Or, for a single document:
- Open a document in the editor.
- Right-click the editing area and select Prepare Document.
The outcome of the operation will be displayed in the Output panel of the Console tool window.
If two ore more documents were prepared, a report will be produced too and it will be accessible through the Report tool window.
You can also prepare different types of JSON files:
-
JSON files with the output structure of the NL Flow Extract Converter processor, whose preparation will provide the:
.txt
file in the test folder with the textual content of the JSON.-
.lay
file containing the layout annotations allowing you to write rules using attributes like:
-
Input JSON files for NL Flow models in which sections have been defined, whose preparation will provide the:
-
.txt
file in the test folder with the textual content of the JSON. -.ann
file containing the annotations of the document sections. -
JSON files with the output structure of the NL Flow TikaTesseract Converter processor, whose preparation will provide the
.txt
file in the test folder with the textual content of the JSON.