Skip to content

Prepare original documents

It is sometimes necessary to pre-process input documents before analyzing them.

For example, if the input document is the result of OCR, a find-and-replace operation can fix misinterpretations like lowercase "l" letter exchanged for digit "1". Also, if dealing with social media messages, it may be useful to replace abbreviations and acronyms with words that facilitate linguistic analysis.

Document pre-processing can be performed by an external process being run before the text intelligence engine or by the text intelligence engine itself using the onPrepare scripting function. Every text intelligence engine you produce and deploy with Studio will invoke the onPrepare function each time a document is submitted to it and before text analysis. That function is consequently the right place to put script code that manipulates the text to improve the subsequent analysis.
Interactive analysis commands like Analyze, however, do not trigger the onPrepare function, and they act on the files inside the test folder considering them as already prepared. So, in order to simulate pre-processing, use the document preparation procedure described below.

  1. Put or create original documents in the documents folder.

    Tip

    You can organize the files in sub-folders.

  2. In the Project window, select the files and/or folders you want to pre-process. If you select the documents folder, all its contents will be pre-processed.

  3. Right-click any of the selected items and select Prepare Selection. If your selection includes sub-folders of the documents folder, they are re-created in the test folder.

    Warning

    If the test folder already contains files and/or folders with the same name and location as items you have prepared, the items in the test folder are overwritten. If you are interested in keeping them, then, make a backup copy.

Or, for a single document:

  1. Open a document in the editor.
  2. Right-click the editing area and select Prepare Document.

The outcome of the operation will be displayed in the Output panel of the Console tool window.
If two ore more documents were prepared, a report will be produced too and it will be accessible through the Report tool window.

You can also prepare different types of JSON files:

  • JSON files with the output structure of the NL Flow Extract Converter processor, whose preparation will provide the:

    • .txt file in the test folder with the textual content of the JSON.
    • .lay file containing the layout annotations allowing you to write rules using attributes like:

  • Input JSON files for NL Flow models in which sections have been defined, whose preparation will provide the:

    -.txt file in the test folder with the textual content of the JSON. - .ann file containing the annotations of the document sections.

  • JSON files with the output structure of the NL Flow TikaTesseract Converter processor, whose preparation will provide the .txt file in the test folder with the textual content of the JSON.