Skip to content

Tika Converter processor

Description

The Tika Converter processor gets plain text out of a binary file.
It's based on Apache Tika, see its documentation for the list of supported file formats.

Input

The processor requires the input JSON to contain these top level keys:

"path": "filePath",
"base64": "base64Encoding"

where:

  • filePath is the name of the file to process or its path.
  • base64Encoding is the Base64 encoding of the file.

The value of path is only for logging or auditing purposes, it is not used to "read" the file, which instead is completely represented by the base64 value.

Block properties

Block properties can be set by editing the block.
Tika Converter workflow blocks have the following properties:

  • Common:

    • The unique block ID and the service version, displayed in the title bar (read only, displayed also in the block tooltip in the canvas).
    • Block name: the block name, it can be edited.
    • Description: the description of the processor (read only).
  • Type Specific:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
  • Deployment:

    • Replicas: number of required instances.
    • Memory: required memory.
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs).
  • Input

    Used for input mapping: one property for each of the top level keys of the input JSON.
    If:

    • The block is the first in a flow and the workflow input contains only the expected keys.

    Or:

    these properties do not need to be set.
    Otherwise, the properties determine which top level keys of the overall "upstream JSON" must be mapped to the block's input keys. The values of the properties must be set choosing from the compatible keys of upstream blocks' output or, if the input format of the workflow has been defined, from the keys of the $nlflow_input pseudo block.

Output and output-input mapping

The output of a Tika Converter block is a JSON object with the following structure:

{
    "content": converted text,
    "mime": media type,
    "path": file name or path    
}

Typically, the content key is mapped to the text input property of model blocks.