TikaTesseract Converter processor

Description

The TikaTesseract Converter processor tries to extract plain text from a file.

It's a combination of Apache Tika and Tesseract OCR.

If the file format is recognized as one of those supported by Tika and the OCR strategy (see below) is Smart or Never, the file is processed by Tika.

If the file format is one of these:

PDF
PNG
JPEG
TIFF
GIF
WebP
BMP
PNM

and the OCR strategy is Always, the file is instead processed by Tesseract OCR.

In case the file has been processed by Tika, OCR strategy is Smart and Tika has not extracted text or has extracted less text than Smart OCR threshold (see below), if the file is in one of the formats above, it is submitted to Tesseract OCR.

Versions

There are two versions of the component available: 1.0.0 and 1.1.0. Version 1.0.0 is present for backward compatibility with old workflows created with previous versions of NL Flow. For new workflows always use the latest version.

Input

A TikaTessercat Converter block has these input variables:

base64 (string, required): Base64 encoding of the file.
path (string, required): name or the path of the file. It is only for debugging, logging or auditing purposes, it is not used to "read" the file, which instead is completely represented by the value of the base64 key.
enableOcr (string, optional): overrides the value of the functional parameter Enable OCR strategy (see below). Allowed values are SMART, ALWAYS and NEVER.
smartOcrThreshold (number, optional): overrides the value of the functional parameter Smart OCR threshold (see below).

Block properties

Block properties can be set by editing the block.
TikaTesseract Converter workflow blocks have the following properties:

Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)
Functional:
- Enable OCR strategy: set the OCR strategy explained above.
  Possible values are:
  - Smart: , the processor first tries to extract characters without recurring to OCR. If the size of the extracted text, in characters, is lower than the value of the Smart OCR threshold parameter, extraction is tried again with OCR.
  - Always: text extraction with OCR is always tried.
  - Never: OCR is never used.
- Content characters threshold: maximum number of extracted characters to be returned.
- Smart OCR threshold: when the OCR strategy is Smart (see above), if Tika extracts less characters than the value of this parameter, extraction with OCR is tried.
- Fail if conversion result is empty: if turned on, the block returns an error when no text is extracted from the input file.
Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
- Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
Input: input properties correspond to the input variables of the component (see above).
Output: read-only, the output manifest of the component.

Output and output-input mapping

The output of a TikaTesseract Converter block is a JSON object with the following structure:

{
    "content": converted text,
    "mime": media type,
    "path": echo of the value of the input variable with the same name   
}

Typically, the content key is mapped to the text input variable of a downstream model block.