Similarity Calculator
Description
A block of the Similarity Calculator component queries an Elasticsearch index prepared to manage a similarity use case.
It's main purpose is to find, between indexed documents, those that are similar to a given one, but can also be used for generic searches in the index and to retrieve the indexed data of a given document.
Input
A Similarity Calculator block has these input variables:
documentld
(string, required): indexed ID of the pivot document. It is considered ifmode
is SIMILARITY or RETRIEVE, ignored whenmode
is SEARCH.esearch
(object): Query DSL expression for query context. It is used to find documents whenmode
is SEARCH, it is ignored otherwise.efilters
(array): Query DSL expression for filter context. Used whenmode
is SIMILARITY to filter similar documents: only documents with metadata matching this filter are returned. It is ignored iffilters
is specified.-
filters
(object): whenmode
is SIMILARITY, it's an optional filter for similar documents: only documents with matching metadata are returned.efilters
is ignored iffilters
is specified.
If used, the object must have one property for each metadata that must be matched. The property acts as a filter condition. The name of the property is the name of the metadata. The property is an array whose items are the values to match in a OR combination, so any matched value satisfies the filter condition.
For example, if documents represent novels and they have anauthor
metadata, this value offilters
:"filters": { "author": ["John Ronald Reuel Tolkien", "Clive Staples Lewis"] }
will keep only the documents with an instance of metadata
author
set to John Ronald Reuel Tolkien or Clive Staples Lewis.If more filter conditions are specified, they are considered combined in AND.
-
maxDocuments
(integer): maximum number of documents to return. minScore
(number): minimum score that a document must have to be returned.mode
(string, required): query mode chosen between:- SIMILARITY: get documents similar to the document with ID equal to
documentId
. - SEARCH: find indexed documents that match
esearch
, ignoringdocumentId
. - RETRIEVE: get indexed metadata for the document with ID equal to
documentId
.
- SIMILARITY: get documents similar to the document with ID equal to
outputFields
(array): optional metadata to be returned for each document. Every item of the array is the name of a metadata.similarityIndex
(string, required): identifier of the index to query. The actual name of the Elasticsearch index is the value ofsimilarityIndex
prefixed by a constant string.spk
(object): override of the SPK functional property. Reserved for system workflows, must be omitted.
Block properties
Block properties can be set by editing the block.
Similarity Calculator workflow blocks have the following properties:
-
Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)
-
Functional:
- Elasticsearch Timeout: maximum time, in milliseconds, the block waits a response from Elasticsearch before giving an error.
- SPK: similarity model. It is used when input variable
mode
is SIMILARITY to tell Elasticsearch which metadata to compare to find similar documents and possibly which boost give to the similarity score portion generated by the match of given metadata.
-
Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
- Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
-
Input: input properties correspond to the input variables of the component (see above).
-
Output: read-only, the output manifest of the component.
Output
The output of a Similarity Calculator block is a JSON object with the following structure:
{
"documents": [
{
"fields": {},
"id": "abc",
"score": 123,
"scoredFields": {},
},
...
],
"maxScore": 123
}
where:
-
documents
(array): documents' data, one item for each returned document. They are documents similar to the document with ID equal to input variabledocumentId
if input variable mode is SIMILARITY, indexed documents matching input variableesearch
if mode is SEARCH or the document with ID equal to input variabledocumentId
if mode is RETRIEVE.
Each item has these properties:fields
(object): document metadata corresponding to theoutputFields
input variable. The object has one property for each metadata listed inoutputFields
that doesn't have a score. The property is an array whose items are the values of the metadata.id
(string): document ID.score
(number): document score. It is the similarity score if input variablemode
is SIMILARITY, constant value 1 ifmode
is SEARCH and constant value 0 ifmode
is RETRIEVE.-
scoredFields
(object): likefields
, but only for metadata listed inoutputFields
that have a score. This kind of score is not computed by Elasticsearch, it is an attribute of the value of the metadata that gets indexed together with the value. The value of scored metadata has this format:value|score
where
value
is the value of the metadata andscore
is the score of the value. Examples of scored fields are main lemmas and extractions. -
maxScore
(number): highest value ofscore
in the items ofdocuments
.