Skip to content

KEYWORD attribute

The KEYWORD attribute identifies a token by specifying the exact sequence of characters that must be found in a text.

The syntax is:

KEYWORD("string1"[, "string2", ...])

where:

  • KEYWORD is the attribute name and must be written in uppercase.
  • string# is a sequence of alphabetical characters, numbers, spaces and other punctuation marks.

If you need to use quotation marks (") inside a string, escape them with the backslash character (\). For example:

("\"cool\"")

matches:

That's a "cool" car

but not:

That's a cool car

Note

If you need to match a single backlash (\), it is not possible to escape it with the same character unless you use the PATTERN attribute, like the following case:

PATTERN("\\")

If string# is written in lowercase the match is case insensitive, so:

KEYWORD("triumph")

matches:

triumph
Triumph
TRIUMPH
triumph
...

If string# contains at least one upper-case character the match is case sensitive. For example:

KEYWORD("Triumph")

matches only Triumph. As a result, the following:

KEYWORD("Triumph", "triumph")

is considered an error, since string1 and string2 are equivalent.

The KEYWORD attribute can be used in a number of cases.

  • To identify a generic string, regardless of its possible meanings and uses.

    KEYWORD("card")
    

    In this case, any token in a document that matches the string is identified. In other words, every time that the word card appears in a document, it is matched by the KEYWORD attribute. Not only does the attribute match the simple word card, but also credit card, card game, discount card etc. On the other hand, a word such as postcard is not matched because KEYWORD only matches whole words and not sub-strings (use PATTERN for this).

  • To identify a proper noun or a collocation that does not exist in the Knowledge Graph. For example:

    KEYWORD("John Smith")
    KEYWORD("sulphite reductor", "sulphite reductors")
    

    The first example states that John Smith must be found in a document, therefore John Smiths, which is a different name, is not matched by this keyword. The string is written with the first letters as uppercase so the match is case sensitive. This way, it is possible to avoid mismatches on any lower-case or uppercase tokens that appear in the document such as john smith, JOHN SMITH, John SMITH etc. The second example specifies two strings: the singular and plural forms of a collocation. In fact, since the strings are considered as exact sequences of characters, any variations or inflected forms must be specified for a token to be recognized in a document.

  • To identify a particular phraseology, no matter how complex. For example:

    KEYWORD("sulphured hydrogen reduction through Idemitsu process")
    

    The string above will only match its identical token, if found in a document; slightly different versions of the string are not considered equivalent and will not be matched in a text. Therefore the phrases sulphured hydrogen reduced through Idemitsu process and sulphured hydrogen reduction through Idemitsu processes will not match due to the tokens reduced and processes not matching the original string of reduction and process.

Warning

Some words might result in a single string split in multiple atoms.

When the engine tries to match the atoms inside this string using the KEYWORD attribute, it will always match from the beginning of the atom to the end of the whole string.

For instance, the string cannot will generate two keywords: cannot and not.

While you can access the KEYWORD not after the LEMMA can, you cannot access the KEYWORD can before the not.