Skip to content

Similarity Calculator

Description

A block of the Similarity Calculator component queries an Elasticsearch index prepared to manage a similarity use case.
It's main purpose is to find, between indexed documents, those that are similar to a given one, but can also be used for generic searches in the index and to retrieve the indexed data of a given document.

Input

A Similarity Calculator block has these input variables:

  • documentld (string, required): indexed ID of the pivot document. It is considered if mode is SIMILARITY or RETRIEVE, ignored when mode is SEARCH.
  • esearch (object): Query DSL expression for query context. It is used to find documents when mode is SEARCH, it is ignored otherwise.
  • efilters (array): Query DSL expression for filter context. Used when mode is SIMILARITY to filter similar documents: only documents with metadata matching this filter are returned. It is ignored if filters is specified.
  • filters (object): when mode is SIMILARITY, it's an optional filter for similar documents: only documents with matching metadata are returned. efilters is ignored if filters is specified.
    If used, the object must have one property for each metadata that must be matched. The property acts as a filter condition. The name of the property is the name of the metadata. The property is an array whose items are the values to match in a OR combination, so any matched value satisfies the filter condition.
    For example, if documents represent novels and they have an author metadata, this value of filters:

    "filters": {
        "author": ["John Ronald Reuel Tolkien", "Clive Staples Lewis"]
    }
    

    will keep only the documents with an instance of metadata author set to John Ronald Reuel Tolkien or Clive Staples Lewis.

    If more filter conditions are specified, they are considered combined in AND.

  • maxDocuments (integer): maximum number of documents to return.

  • minScore (number): minimum score that a document must have to be returned.
  • mode (string, required): query mode chosen between:
    • SIMILARITY: get documents similar to the document with ID equal to documentId.
    • SEARCH: find indexed documents that match esearch, ignoring documentId.
    • RETRIEVE: get indexed metadata for the document with ID equal to documentId.
  • outputFields (array): optional metadata to be returned for each document. Every item of the array is the name of a metadata.
  • similarityIndex (string, required): identifier of the index to query. The actual name of the Elasticsearch index is the value of similarityIndex prefixed by a constant string.
  • spk (object): override of the SPK functional property. Reserved for system workflows, must be omitted.

Block properties

Block properties can be set by editing the block.
Similarity Calculator workflow blocks have the following properties:

  • Basic properties:

    • Block name, it can be edited
    • Component version (read only)
    • Block ID (read only)
  • Functional:

    • Elasticsearch Timeout: maximum time, in milliseconds, the block waits a response from Elasticsearch before giving an error.
    • SPK: similarity model. It is used when input variable mode is SIMILARITY to tell Elasticsearch which metadata to compare to find similar documents and possibly which boost give to the similarity score portion generated by the match of given metadata.
  • Deployment:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
    • Replicas: number of required instances.
    • Memory: required memory.
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
    • Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
  • Input: input properties correspond to the input variables of the component (see above).

  • Output: read-only, the output manifest of the component.

Output

The output of a Similarity Calculator block is a JSON object with the following structure:

{
    "documents": [
        {
            "fields": {},
            "id": "abc",
            "score": 123,
            "scoredFields": {},
        },
        ...
    ],
    "maxScore": 123
}

where:

  • documents (array): documents' data, one item for each returned document. They are documents similar to the document with ID equal to input variable documentId if input variable mode is SIMILARITY, indexed documents matching input variable esearch if mode is SEARCH or the document with ID equal to input variable documentId if mode is RETRIEVE.
    Each item has these properties:

    • fields (object): document metadata corresponding to the outputFields input variable. The object has one property for each metadata listed in outputFields that doesn't have a score. The property is an array whose items are the values of the metadata.
    • id (string): document ID.
    • score (number): document score. It is the similarity score if input variable mode is SIMILARITY, constant value 1 if mode is SEARCH and constant value 0 if mode is RETRIEVE.
    • scoredFields (object): like fields, but only for metadata listed in outputFields that have a score. This kind of score is not computed by Elasticsearch, it is an attribute of the value of the metadata that gets indexed together with the value. The value of scored metadata has this format:

      value|score

    where value is the value of the metadata and score is the score of the value. Examples of scored fields are main lemmas and extractions.

  • maxScore (number): highest value of score in the items of documents.