Tika Converter processor
Description
The Tika Converter processor gets plain text out of a binary file.
It's based on Apache Tika, see its documentation for the list of supported file formats.
Input
The processor requires the input JSON to contain these top level keys:
"path": "filePath",
"base64": "base64Encoding"
where:
filePath
is the name of the file to process or its path.base64Encoding
is the Base64 encoding of the file.
The value of path
is only for logging or auditing purposes, it is not used to "read" the file, which instead is completely represented by the base64
value.
Block properties
Block properties can be set by editing the block.
Tika Converter workflow blocks have the following properties:
-
Common:
- The unique block ID and the service version, displayed in the title bar (read only, displayed also in the block tooltip in the canvas).
- Block name: the block name, it can be edited.
- Description: the description of the processor (read only).
-
Type Specific:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
-
Deployment:
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs).
-
Input
Used for input mapping: one property for each of the top level keys of the input JSON.
If:- The block is the first in a flow and the workflow input contains only the expected keys.
Or:
- The previous block's output contains only the expected keys.
these properties do not need to be set.
Otherwise, the properties determine which top level keys of the overall "upstream JSON" must be mapped to the block's input keys. The values of the properties must be set choosing from the compatible keys of upstream blocks' output or, if the input format of the workflow has been defined, from the keys of the $nlflow_input pseudo block.
Output and output-input mapping
The output of a Tika Converter block is a JSON object with the following structure:
{
"content": converted text,
"mime": media type,
"path": file name or path
}
Typically, the content
key is mapped to the text input property of model blocks.