Language Detector processor

Description

The Language Detector processor predicts the languages in which a text is written. It can accept fragmented text from the Text Fragmenter processor, and, in that case, it make predictions about the languages for each fragment too.

Versions

There are two versions of the component available: 1.0.0 and 1.1.0. Version 1.0.0 is present for backward compatibility with old workflows created with previous versions of NL Flow. For new workflows always use the latest version.

Input

A block of this processor has these input variables:

text (string, required): the text to process.
fragments (object, optional): corresponds to the value of the fragments key in the output of a Text Fragmenter block. With this input, the processor makes predictions about the language of each fragment in addition to the predictions for the whole text.

options (object, optional): an object whose properties can be used to override the values of block's functional properties (see below). This is the correspondence between the properties of options and the functional properties of the block:

Object property	Corresponding functional parameter
`languages`	Detectable languages
`outputText`	Propagate input text to output
`enableOthers`	Enable "Other language" prediction
`maxPredictions`	Max number of predictions

Block properties

Block properties can be set by editing the block.
Language Detector workflow blocks have the following properties:

Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)
Functional:
- Detectable languages
  
  The comma separated list of ISO-639-1 codes¹ of the languages the processor can choose from when making predictions.
- Enable "Other language" prediction: enables the prediction of a other label corresponding to languages that are not listed in Detectable languages. Default: on.
- Max number of predictions: maximum number of predictions. Default: 10.
Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances
- Memory: required memory
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU)
Input: input properties correspond to the input variables of the component (see above).
Output: read-only, the output manifest of the component.

Output

In case the input contains only text—no fragments—, the block output has this structure:

{
    "prediction": {}
}

If there are also fragments in the input, the output has this structure:

{
    "prediction": {},
    "fragmentsPredictions": []
}

If input variable options.outputText is set to true or is missing and functional property Propagate input text to output is turned on, the output also contains a top level key text which is the echo of input variable text, for example:

{
    "prediction": {
        "others": [
            {
            "label": "de",
            "score": 0.004342068452388048
            },
            {
            "label": "es",
            "score": 0.0034704774152487518
            },
            {
            "label": "ru",
            "score": 0.0028054893482476475
            }
        ]
        "winner": {
            "label": "en",
            "score": 0.9228296875953674
        }
    },
    "text": "How to Pick the Right Coffee Table\nWhen you shop for a coffee table you may be overwhelmed by the wealth of choices available. Coffee tables, sometimes called cocktail tables, come in many styles and materials. Whether you have a comfortable farmhouse look, breezy coastal decor or sleek contemporary furniture, you can find the perfect coffee table for your main living space. If you make the coffee table the last piece of furniture you choose for the room, it is easier to judge the right style, color, material, size and shape.\nHere are some guidelines for finding just the right coffee table to hold the remote and a drink when you settle in for a night of relaxation:\n1. Choose a Style\nRemember that as functional as a coffee table may be, it is really an example of accent furniture."
}

The prediction object has this structure:

"prediction": {
    "winner": {},
    "others": []
}

where:

winner (object): it corresponds to the most likely language prediction for the entire text. It has these properties:
- label (string): ISO-639-1 code of the predicted language
- score (decimal number between 0 and 1): confidence score of the prediction
others (array): it contains one item for each least likely language. Each item has the same structure as the winner object, with a label and a confidence score. In the array, the items are sorted in descending order on the value of the score property, so the labels with the highest confidence score are found first.

The total number of predictions is influenced by the values of the functional properties of the block, possibly overwritten using the options input variable.
The total number of languages the processor can choose from is determined by the input variable options.languages or, if missing, by the Detectable languages property, with the possible addition of the other label—corresponding to extra languages—when input variable options.enableOthers is true or, if missing, property Enable "Other language" prediction is turned on.
In any case, the total number of predictions is at most equal to the value of input variable options.maxPredictions or, if this key is missing, the value of the Max number of predictions property.

fragmentsPredictions is an array of objects, each of which contains the language predictions for one of the fragments passed in input using the fragments key. Each item has this structure:

{
    "others": [],
    "position": {},
    "winner": {}
}

where:

winner and others have the same structure and the same meaning—but with a scope equal to the text fragment—of the homonymous properties of the prediction object.
position contains the fragment position in the text and is the echo of the item in input array fragments.positions that corresponds to the fragment.

There are the ISO-639-1 codes of the languages that can be detected: af, als, am, an, ar, arz, asm, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, ps, pt, qu, rm, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh ↩