Skip to content

Similarity Indexer

Description

A block of the Similarity Indexer component indexes input data in an Elasticsearch index as part of a workflow supporting the similarity use case. In a typical workflow that indexes the outcome of a predictive model, a block of Similarity Index is preceded by a block of Similarity Document Preparator.

Input

A Similarity Indexer block has these input variables:

  • documentId (string): ID of the document to index. Can be omitted if the user wants the document ID to be generated by Elasticsearch.
  • fields (object, required): fields to index, with their values.
  • similarityId (string, required): index identifier. This value represents the rightmost—and identificative—part of the actual index name which is the concatenation of a constant prefix, a dash and the value of similarityId. For example, if similarityId is astro8, the full name of the index is:

    prefix-astro8

Block properties

Block properties can be set by editing the block.
Similarity Indexer workflow blocks have the following properties:

  • Basic properties:

    • Block name, it can be edited
    • Component version (read only)
    • Block ID (read only)
  • Functional:

    • Elasticsearch Timeout: maximum time, in milliseconds, the block waits a response from Elasticsearch before giving an error.
    • Elasticsearch Verify Index (on/off): when on, the block fails if the index specified by the similarityId input variable doesn't exist. When off, if the index doesn't exist it is created on the fly and then the document is indexed.
  • Deployment:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
    • Replicas: number of required instances.
    • Memory: required memory.
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
    • Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
    • Batch Size: reserved for system workflow, don't change the default value.
    • Batch Timeout in ms: reserved for system workflow, don't change the default value.
  • Input: input properties correspond to the input variables of the component (see above).

  • Output: read-only, the output manifest of the component.

Output

The output of a Similarity Indexer block is a JSON object with this structure:

{
    "documentId": "abc"
}

where documentId is the ID of the indexed document. It coincides with the value of the input variable with the same name, if that was specified, otherwise it is a value generated by Elasticsearch. In the latter case is fundamental that the workflow user stores this value to be able to later use it in a workflow the uses Similarity Calculator to find similar documents.