Similarity Calculator
Description
A block of the Similarity Calculator component queries an Elasticsearch index prepared to manage a similarity use case.
It's main purpose is to find, between indexed documents, those that are similar to a given one, but can also be used for generic searches in the index and to retrieve the indexed data of a given document.
Input
A Similarity Calculator block has these input variables:
documentld(string, required): indexed ID of the pivot document. It is considered ifmodeis SIMILARITY or RETRIEVE, ignored whenmodeis SEARCH.esearch(object): Query DSL expression for query context. It is used to find documents whenmodeis SEARCH, it is ignored otherwise.efilters(array): Query DSL expression for filter context. Used whenmodeis SIMILARITY to filter similar documents: only documents with metadata matching this filter are returned. It is ignored iffiltersis specified.-
filters(object): whenmodeis SIMILARITY, it's an optional filter for similar documents: only documents with matching metadata are returned.efiltersis ignored iffiltersis specified.
If used, the object must have one property for each metadata that must be matched. The property acts as a filter condition. The name of the property is the name of the metadata. The property is an array whose items are the values to match in a OR combination, so any matched value satisfies the filter condition.
For example, if documents represent novels and they have anauthormetadata, this value offilters:"filters": { "author": ["John Ronald Reuel Tolkien", "Clive Staples Lewis"] }will keep only the documents with an instance of metadata
authorset to John Ronald Reuel Tolkien or Clive Staples Lewis.If more filter conditions are specified, they are considered combined in AND.
-
maxDocuments(integer): maximum number of documents to return. minScore(number): minimum score that a document must have to be returned.mode(string, required): query mode chosen between:- SIMILARITY: get documents similar to the document with ID equal to
documentId. - SEARCH: find indexed documents that match
esearch, ignoringdocumentId. - RETRIEVE: get indexed metadata for the document with ID equal to
documentId.
- SIMILARITY: get documents similar to the document with ID equal to
outputFields(array): optional metadata to be returned for each document. Every item of the array is the name of a metadata.similarityIndex(string, required): identifier of the index to query. The actual name of the Elasticsearch index is the value ofsimilarityIndexprefixed by a constant string.spk(object): override of the SPK functional property. Reserved for system workflows, must be omitted.
Block properties
Block properties can be set by editing the block.
Similarity Calculator workflow blocks have the following properties:
-
Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)
-
Functional:
- Elasticsearch Timeout: maximum time, in milliseconds, the block waits a response from Elasticsearch before giving an error.
- SPK: similarity model. It is used when input variable
modeis SIMILARITY to tell Elasticsearch which metadata to compare to find similar documents and possibly which boost give to the similarity score portion generated by the match of given metadata.
-
Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
- Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
-
Input: input properties correspond to the input variables of the component (see above).
-
Output: read-only, the output manifest of the component.
Output
The output of a Similarity Calculator block is a JSON object with the following structure:
{
"documents": [
{
"fields": {},
"id": "abc",
"score": 123,
"scoredFields": {},
},
...
],
"maxScore": 123
}
where:
-
documents(array): documents' data, one item for each returned document. They are documents similar to the document with ID equal to input variabledocumentIdif input variable mode is SIMILARITY, indexed documents matching input variableesearchif mode is SEARCH or the document with ID equal to input variabledocumentIdif mode is RETRIEVE.
Each item has these properties:fields(object): document metadata corresponding to theoutputFieldsinput variable. The object has one property for each metadata listed inoutputFieldsthat doesn't have a score. The property is an array whose items are the values of the metadata.id(string): document ID.score(number): document score. It is the similarity score if input variablemodeis SIMILARITY, constant value 1 ifmodeis SEARCH and constant value 0 ifmodeis RETRIEVE.-
scoredFields(object): likefields, but only for metadata listed inoutputFieldsthat have a score. This kind of score is not computed by Elasticsearch, it is an attribute of the value of the metadata that gets indexed together with the value. The value of scored metadata has this format:value|score
where
valueis the value of the metadata andscoreis the score of the value. Examples of scored fields are main lemmas and extractions. -
maxScore(number): highest value ofscorein the items ofdocuments.