TikaTesseract Converter processor

Description

The TikaTesseract Converter processor tries to extract plain text from a file.

It's a combination of Apache Tika and Tesseract OCR.

If the file format is recognized as one of those supported by Tika and the OCR strategy (see below) is Smart or Never, the file is processed by Tika.

If the file format is one of these:

PDF
PNG
JPEG
TIFF
GIF
WebP
BMP
PNM

and the OCR strategy is Always, the file is instead processed by Tesseract OCR.

In case the file has been processed by Tika, OCR strategy is Smart and Tika has not extracted text or has extracted less text than Smart OCR threshold (see below), if the file is in one of the formats above, it is submitted to Tesseract OCR.

Input

The processor requires the input JSON to contain this top level key:

"base64": "base64Encoding"

where base64Encoding is the Base64 encoding of the file.
Optional keys are:

filePath (string): the name or the path of the file. Only for logging or auditing purposes, it is not used to "read" the file, which instead is completely represented by the value of the base64 key.
enableOcr (string): overrides the value of the functional parameter Enable OCR strategy (see below). Allowed values are SMART, ALWAYS and NEVER.
smartOcrThreshold (number): overrides the value of the functional parameter Smart OCR threshold (see below).

Block properties

Block properties can be set by editing the block.
TikaTesseract Converter workflow blocks have the following properties:

Common:
- The unique block ID and the service version, displayed in the title bar (read only, displayed also in the block tooltip in the canvas).
- Block name: the block name, it can be edited.
- Description: the description of the processor (read only).
Type Specific:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
Deployment:
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs).
Functional:
- Enable OCR strategy: set the OCR strategy explained above.
  Possible values are:
  - Smart: , the processor first tries to extract characters without recurring to OCR. If the size of the extracted text, in characters, is lower than the value of the Smart OCR threshold parameter, extraction is tried again with OCR.
  - Always: text extraction with OCR is always tried.
  - Never: OCR is never used.
- Content characters threshold: maximum number of extracted characters to be returned.
- Smart OCR threshold: when the OCR strategy is Smart (see above), if Tika extracts less characters than the value of this parameter, extraction with OCR is tried.
Input

Used for input mapping: one property for each of the top level keys of the input JSON.
If:
- The block is the first in a flow and the workflow input contains only the expected keys.
Or:
- The previous block's output contains only the expected keys.
these properties do not need to be set.
Otherwise, the properties determine which top level keys of the overall "upstream JSON" must be mapped to the block's input keys. The values of the properties must be set choosing from the compatible keys of upstream blocks' output or, if the input format of the workflow has been defined, from the keys of the $nlflow_input pseudo block.

Output and output-input mapping

The output of a TikaTesseract Converter block is a JSON object with the following structure:

{
    "content": converted text,
    "mime": media type,
    "path": file name or path    
}

Typically, the content key is mapped to the text input property of model blocks.