Skip to content

Language Detector processor

Description

The Language Detector processor predicts the languages in which a text is written. It can accept fragmented text from the Text Fragmenter processor, and, in that case, it make predictions about the languages for each fragment too.

Old version

There are two versions of the component available: the latest and 1.0.0. This version is present for backward compatibility with old workflows created with previous versions of NL Flow. For new workflows always use the latest version.

The blocks of version 1.0.0 are identical to those of version 1.9 of NL Flow, however something has changed in the block properties:

  • The name of the block, instead of editing the Block name field, can be modified by selecting Edit component name .
  • The Propagate input text to output block property has been removed from the Functional tab.
  • The Output tab has been added (see below).

All other block properties remain unchanged, so if you are dealing with a version 1.0.0 block refer to the NL Flow version 1.9 documentation and ignore the following here except for the Output tab.

Input

A block of this processor has these input variables:

  • text (string, required): the text to process.
  • fragments (object, optional): corresponds to the value of the fragments key in the output of a Text Fragmenter block. With this input, the processor makes predictions about the language of each fragment in addition to the predictions for the whole text.
  • options (object, optional): an object whose properties can be used to override the values of block's functional properties (see below). This is the correspondence between the properties of options and the functional properties of the block:

    Object property Corresponding functional parameter
    languages Detectable languages
    outputText Propagate input text to output
    enableOthers Enable "Other language" prediction
    maxPredictions Max number of predictions

Block properties

Block properties can be set by editing the block.
Language Detector workflow blocks have the following properties:

  • Basic properties:

    • Block name, it can be edited
    • Component version (read only)
    • Block ID (read only)
  • Functional:

    • Detectable languages

      The comma separated list of ISO-639-1 codes1 of the languages the processor can choose from when making predictions.

    • Enable "Other language" prediction: enables the prediction of a other label corresponding to languages that are not listed in Detectable languages. Default: on.

    • Max number of predictions: maximum number of predictions. Default: 10.

  • Deployment:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
    • Replicas: number of required instances
    • Memory: required memory
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPU)
  • Input: these properties correspond to the input variables.
    Input properties are read-only—so only descriptive of the expected input—when the block is the first in a flow and the workflow's input has not been explicitly described. In that case the workflow's input JSON must contain keys whose name and type match those of the input variables.
    Otherwise, they are editable and must be set.

  • Output: read-only, this property is a navigable description of the structure of the output JSON.

Output

In case the input contains only text—no fragments—, the block output has this structure:

{
    "prediction": {}
}

If there are also fragments in the input, the output has this structure:

{
    "prediction": {},
    "fragmentsPredictions": []
}

If input variable options.outputText is set to true or is missing and functional property Propagate input text to output is turned on, the output also contains a top level key text which is the echo of input variable text, for example:

{
    "prediction": {
        "others": [
            {
            "label": "de",
            "score": 0.004342068452388048
            },
            {
            "label": "es",
            "score": 0.0034704774152487518
            },
            {
            "label": "ru",
            "score": 0.0028054893482476475
            }
        ]
        "winner": {
            "label": "en",
            "score": 0.9228296875953674
        }
    },
    "text": "How to Pick the Right Coffee Table\nWhen you shop for a coffee table you may be overwhelmed by the wealth of choices available. Coffee tables, sometimes called cocktail tables, come in many styles and materials. Whether you have a comfortable farmhouse look, breezy coastal decor or sleek contemporary furniture, you can find the perfect coffee table for your main living space. If you make the coffee table the last piece of furniture you choose for the room, it is easier to judge the right style, color, material, size and shape.\nHere are some guidelines for finding just the right coffee table to hold the remote and a drink when you settle in for a night of relaxation:\n1. Choose a Style\nRemember that as functional as a coffee table may be, it is really an example of accent furniture."
}

The prediction object has this structure:

"prediction": {
    "winner": {},
    "others": []
}

where:

  • winner (object): it corresponds to the most likely language prediction for the entire text. It has these properties:

    • label (string): ISO-639-1 code of the predicted language
    • score (decimal number between 0 and 1): confidence score of the prediction
  • others (array): it contains one item for each least likely language. Each item has the same structure as the winner object, with a label and a confidence score. In the array, the items are sorted in descending order on the value of the score property, so the labels with the highest confidence score are found first.

The total number of predictions is influenced by the values ​​of the functional properties of the block, possibly overwritten using the options input variable.
The total number of languages ​​the processor can choose from is determined by the input variable options.languages or, if missing, by the Detectable languages property, with the possible addition of the other label—corresponding to extra languages—when input variable options.enableOthers is true or, if missing, property Enable "Other language" prediction is turned on.
In any case, the total number of predictions is at most equal to the value of input variable options.maxPredictions or, if this key is missing, the value of the Max number of predictions property.

fragmentsPredictions is an array of objects, each of which contains the language predictions for one of the fragments passed in input using the fragments key. Each item has this structure:

{
    "others": [],
    "position": {},
    "winner": {}
}

where:

  • winner and others have the same structure and the same meaning—but with a scope equal to the text fragment—of the homonymous properties of the prediction object.
  • position contains the fragment position in the text and is the echo of the item in input array fragments.positions that corresponds to the fragment.

  1. There are the ISO-639-1 codes of the languages that can be detected: af, als, am, an, ar, arz, asm, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, ps, pt, qu, rm, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh