Language Detector processor

Description

The Language Detector processor predicts the languages in which a text is written. It can accept fragmented text from the Text Fragmenter processor, and, in that case, it make predictions about the languages for each fragment too.

Input

The processor requires the input JSON to contain at least this top level key:

"text": "text"

where text is the text to process.
Optional top level keys are:

"fragments": fragments

and:

"options": options

fragments is an object and corresponds to the value of the fragments key in the output of a Text Fragmenter block. With this input, the processor makes predictions about the language of each fragment in addition to the predictions for the whole text.

options is an object. Its properties can be used to override the values of block's functional properties (see below). This is the correspondence between the properties of options and the functional properties of the block:

Object property	Corresponding functional parameter
`languages`	Detectable languages
`outputText`	Propagate input text to output
`enableOthers`	Enable "Other language" prediction
`maxPredictions`	Max number of predictions

Block properties

Block properties can be set by editing the block.
Language Detector workflow blocks have the following properties:

Common:
- The unique block ID and the service version, displayed in the title bar (read only, displayed also in the block tooltip in the canvas).
- Block name: the block name, it can be edited.
- Description: the description of the processor (read only).
Type Specific:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
Functional:
- Detectable languages
  
  The comma separated list of ISO-639-1 codes¹ of the languages the processor can choose from when making predictions.
- Propagate input text to output: when turned on, the input key text is echoed in the output JSON. Default: off.
- Enable "Other language" prediction: enables the prediction of a other label corresponding to languages that are not listed in Detectable languages. Default: on.
- Max number of predictions: maximum number of predictions. Default: 10.
Deployment:
- Replicas: number of required instances
- Memory: required memory
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs)
Input

Used for input mapping: one property for each of the top level keys of the input JSON.
If:
- The block is the first in a flow and the workflow input contains only the expected keys.
Or:
- The previous block's output contains only the expected keys.
these properties do not need to be set.
Otherwise, the properties determine which top level keys of the overall "upstream JSON" must be mapped to the block's input keys. The values of the properties must be set choosing from the compatible keys of upstream blocks' output or, if the input format of the workflow has been defined, from the keys of the $nlflow_input pseudo block.

Output

In case the input contains only text—no fragments—, the block output has this structure:

{
    "prediction": {}
}

If there are also fragments in the input, the output has this structure:

{
    "fragmentsPredictions": [],
    "prediction": {}
}

If input key options.outputText is set to true or is missing and functional property Propagate input text to output is turned on, the output also contains a top level key text which is the echo of input key text, for example:

{
    "prediction": {
        "others": [
            {
            "label": "de",
            "score": 0.004342068452388048
            },
            {
            "label": "es",
            "score": 0.0034704774152487518
            },
            {
            "label": "ru",
            "score": 0.0028054893482476475
            }
        ]
        "winner": {
            "label": "en",
            "score": 0.9228296875953674
        }
    },
    "text": "How to Pick the Right Coffee Table\nWhen you shop for a coffee table you may be overwhelmed by the wealth of choices available. Coffee tables, sometimes called cocktail tables, come in many styles and materials. Whether you have a comfortable farmhouse look, breezy coastal decor or sleek contemporary furniture, you can find the perfect coffee table for your main living space. If you make the coffee table the last piece of furniture you choose for the room, it is easier to judge the right style, color, material, size and shape.\nHere are some guidelines for finding just the right coffee table to hold the remote and a drink when you settle in for a night of relaxation:\n1. Choose a Style\nRemember that as functional as a coffee table may be, it is really an example of accent furniture."
}

The prediction object has this structure:

"prediction": {
    "others": [],
    "winner": {}
}

winner is an object corresponding to the most likely language prediction for the entire text. It has these properties:

label (string): ISO-639-1 code of the predicted language
score (decimal number between 0 and 1): confidence score of the prediction

others is an array with one item for each least likely language.
Each item has the same structure as the winner object, with a label and a confidence score. In the array, the items are sorted in descending order on the value of the score property, so the labels with the highest confidence score are found first.

The total number of predictions is influenced by the values of the functional properties of the block, possibly overwritten using the options input key.
The total number of languages the processor can choose from is determined by the input key options.languages or, if missing, by the Detectable languages property, with the possible addition of the other label—corresponding to extra languages—when input key options.enableOthers is true or, if missing, property Enable "Other language" prediction is turned on.
In any case, the total number of predictions is at most equal to the value of input key options.maxPredictions or, if this key is missing, the value of the Max number of predictions property.

fragmentsPredictions is an array of objects, each of which contains the language predictions for one of the fragments passed in input using the fragments key. Each item has this structure:

{
    "others": [],
    "position": {},
    "winner": {}
}

where winner and others have the same structure and the same meaning—but with a scope equal to the text fragment—of the homonymous properties of the prediction object, while position contains the fragment position in the text and is the echo of the item in input array fragments.positions that corresponds to the fragment.

There are the ISO-639-1 codes of the languages that can be detected: af, als, am, an, ar, arz, asm, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, ps, pt, qu, rm, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh ↩