Similarity Document Preparator

Description

A block of the Similarity Document Preparator component pre-processes the output of a predictive model and optional additional metadata to produce an output compatible with the input of a Similarity Indexer block, that is a list of Elasticsearch fields corresponding to input metadata, each with its values.
The block uses the lists of metadata and, possibly, sections, contained in a similarity model to determine which metadata to pre-process and how.

Input

A Similarity Document Preparator block has these input variables:

customMetadata (object): optional additional metadata to pre-process.
modelAnalyzedDocument (object): analysis output to map to the document key of the output of the upstream block of a predictive model.
modelOutput (string): reserved for system workflows, must be omitted.
modelOutputCompressionAlgorithm (string): reserved for system workflows, must be omitted.
spk (object): override of the SPK functional property. Reserved for system workflows, must be omitted.

Block properties

Block properties can be set by editing the block.
Similarity Document Preparator workflow blocks have the following properties:

Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)
Functional:
- SPK: similarity model. Is used by the block to determine which input metadata taken from the document input variable must be pre-processed and how. The block doesn't use the similarity model to pre-process custom metadata corresponding to the customMetadata input variable, if any. Custom metadata is simply "echoed" in the block's [output](#output].
Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
- Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
Input: input properties correspond to the input variables of the component (see above).
Output: read-only, the output manifest of the component.

Output

The output of a Similarity Document Preparator block is a JSON object compatible with the input of a Similarity Indexer block.
For example:

{
    "outputFields": {
        "scored_std_tokens_MLE_BODY": [
            "west|50.29",
            "curve|46.09"
        ],
        "std_tokens_MLE_BODY": [
            "west",
            "curve"
        ],
        "scored_std_THESAURUS_BODY": [
            "The Moon|100.0",
            "Milky Way Galaxy|0.0",
            "Cosmology|0.0",
            "Natural satellites|50.0"
        ],
        "scored_std_tokens_TPC": [
            "astronomy|4.3"
        ],
        "scored_std_tokens_MPH_BODY": [
            "be at west|100.0"
        ],
        "std_tokens_MPH_BODY": [
            "be at west"
        ],
        "std_tokens_TPC": [
            "astronomy"
        ],
        "std_THESAURUS_BODY": [
            "The Moon",
            "Milky Way Galaxy",
            "Cosmology",
            "Natural satellites"
        ]
    }
}

The output has the following structure:

{
    "outputFields": {}
}

where the outputFields object contains a property for each metadata to index.
Every property is an array. The property name is the name of the Elasticsearch field corresponding to the metadata, while the items of the array are the values of the metadata to index.
If the input metadata had a score—this happens with tokens like the main lemmas or with extractions—the format of the value is:

value|score

where value is the value of the metadata and score is the score of the value.