Skip to content

TikaTesseract Converter processor

Description

The TikaTesseract Converter processor tries to extract plain text from a file.

It's a combination of Apache Tika and Tesseract OCR.

If the file format is recognized as one of those supported by Tika and the OCR strategy (see below) is Smart or Never, the file is processed by Tika.

If the file format is one of these:

  • PDF
  • PNG
  • JPEG
  • TIFF
  • GIF
  • WebP
  • BMP
  • PNM

and the OCR strategy is Always, the file is instead processed by Tesseract OCR.

In case the file has been processed by Tika, OCR strategy is Smart and Tika has not extracted text or has extracted less text than Smart OCR threshold (see below), if the file is in one of the formats above, it is submitted to Tesseract OCR.

Old version

There are two versions of the component available: the latest and 1.0.0. This version is present for backward compatibility with old workflows created with previous versions of NL Flow. For new workflows always use the latest version.

The blocks of version 1.0.0 are identical to those of version 1.9 of NL Flow, however something has changed in the block properties:

  • The name of the block, instead of editing the Block name field, can be modified by selecting Edit component name .
  • The Output tab has been added (see below).

All other block properties remain unchanged, so if you are dealing with a version 1.0.0 block refer to the NL Flow version 1.9 documentation and ignore the following here except for the Output tab.

Input

The processor requires the input JSON to contain this top level key:

"base64": "base64Encoding"

where base64Encoding is the Base64 encoding of the file.
Optional keys are:

  • filePath (string): the name or the path of the file. Only for logging or auditing purposes, it is not used to "read" the file, which instead is completely represented by the value of the base64 key.
  • enableOcr (string): overrides the value of the functional parameter Enable OCR strategy (see below). Allowed values are SMART, ALWAYS and NEVER.
  • smartOcrThreshold (number): overrides the value of the functional parameter Smart OCR threshold (see below).

Block properties

Block properties can be set by editing the block.
TikaTesseract Converter workflow blocks have the following properties:

  • Basic properties:

    • Block name, it can be edited
    • Component version (read only)
    • Block ID (read only)
  • Functional:

    • Enable OCR strategy: set the OCR strategy explained above.
      Possible values are:

      • Smart: , the processor first tries to extract characters without recurring to OCR. If the size of the extracted text, in characters, is lower than the value of the Smart OCR threshold parameter, extraction is tried again with OCR.
      • Always: text extraction with OCR is always tried.
      • Never: OCR is never used.
    • Content characters threshold: maximum number of extracted characters to be returned.

    • Smart OCR threshold: when the OCR strategy is Smart (see above), if Tika extracts less characters than the value of this parameter, extraction with OCR is tried.
    • Fail if conversion result is empty: if turned on, the block returns an error when no text is extracted from the input file.
  • Deployment:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
    • Replicas: number of required instances.
    • Memory: required memory.
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
    • Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
  • Input

    These properties correspond to the top level key of the input JSON.
    They need to be set only when input mapping is necessary.

  • Output: read-only, this property is a navigable description of the structure of the output array.

Output and output-input mapping

The output of a TikaTesseract Converter block is a JSON object with the following structure:

{
    "content": converted text,
    "mime": media type,
    "path": file name or path    
}

Typically, the content key is mapped to the text input property of a downstream model block.