Tagging
Introduction
Tagging is a way to programmatically add to text tokens new attributes—tag instances—in addition to those found by automatic text analysis. These new attributes can then be matched in the conditions of rules using the TAG
and the BTAG
attributes.
Tag instances are also manageable by JavaScript, which gives maximum flexibility in their use.
A way to add instances of tags to text tokens is using tagging rules. In the document analysis pipeline, tagging rules are evaluated after text analysis and before other rules.
This way tagging rules can have conditions potentially leveraging all the attributes of the tokens discovered by text analysis.
Every tag needs to be declared. The declaration establishes the properties that are common to all the instances of the tag:
- Name
- (Optional) Syncon ID: the identifier of a syncon of the Knowledge Graph
If a syncon ID is part of the tag's declaration, when an instance of the tag is added to a token it actually alters its meaning, something to consider when using the SYNCON
and the ANCESTOR
attributes in rules' conditions.
Each added tag can span multiple tokens, even an entire sentence, and has a dynamic property, called tag entry, which is the textual value of the instance of the tag. It can be the literal text of the tokens covered by the tag or a more complex function of it. Instances of the same tag can be added to multiple—even non consecutive—text tokens and instances of different tags can be added to the same token.
Info
When one instance of a tag spans multiple text tokens, it's as if there was a separate. identical, instance of the tag above each token, but indeed there's only one "long" instance of the tag.
Each instance of a tag sits on a level. Levels can be thought of as layers above the text tokens.
Every tagging rule sets the level of the tag instances it adds, the default level is 10000.
All rules of the same level are evaluated together by the analysis engine, which proceeds in ascending order from the numerically lowest level to determine the rules to evaluate next.
This implies that a tagging rule can match, in its condition—with the TAG
or BTAG
attributes—, any tag instance added by lower level tagging rules, making it possible to add new tag instances based on the pre-existence of other tag instances.
With JavaScript it's possible to add, disable—untag—and rename tag instances.
Tagging is mainly used to write compact, readable rules with less repetition. In fact, if you add instances of a tag to all the parts of the text that have certain characteristics, to refer to those parts of the text you just need to refer to the tag, without having to repeat the list of characteristics every time.
For example, if you have to write multiple categorization rules with conditions that imply the presence of a person's name or address, you can first add an instance of the PERSONAL_DATA tag to all the relevant tokens, then write categorization rules with conditions that use the TAG
attribute to match the PERSONAL_DATA tag.
In fact, a tag can be considered as a personalized marker of each piece of text that has certain characteristics, for the most varied uses.
Declaration
All tags must be declared. You are free to put the declaration in any rules' file, but it's suggested to put it in the config.cr
file.
The declaration of one or more tags has this syntax:
TAGS
{
declaration(s)
}
For example:
TAGS
{
@CODE,
@MEDICINE:100012140
}
Each declaration establish the common properties of each tag instance and has the following syntax:
@tagName[:synconID]
where tagName
is the name of the tag. synconID
is optional and is the identifier of a Knowledge Graph syncon.
When an instance of a tag with synconID
is added to a token already having its own syncon ID, the tag's ID replaces the pre-existing ID.
Multiple declarations must be separated with a comma.
In the example above you find the declaration of two tags, CODE and MEDICINE. For the second tag, a syncon ID (100012140, drug) is declared too. This implies that every time an instance of the tag is added to a text token, the syncon ID attribute of that token will become that of the tag.
Tagging rules
Tag instances can be added with tagging rules. Tagging rules are evaluated after text analysis and before rules of other types.
Like any other rule, a tagging rule's action is triggered if some text in the rule's scope is matched by the rules's condition.
For any operand of the condition starting with the name of a tag preceded by the "at" sign (@
), the tag is added to the text matched by the operand. This is similar to extraction rules, where the action part of the rule is included in its condition, but in that case the text matched by sub-conditions starting with the name of a field preceded by the "at" sign becomes the value of the extracted field.
Here is what a tagging rule with its scope specification looks like:
SCOPE SENTENCE
{
TAGGER()
{
@CODE[PATTERN("\d{8}")]
}
}
The rule adds an instance of tag CODE to every token whose text is a sequence of eight digits.
The syntax is:
TAGGER([tagLevel])
{
condition
}
where tagLevel
is the level on which added tags will sit. It is a positive integer number. It can be omitted, in which case its default value is 10000.
Sub-conditions that determine the action of the rule have this syntax:
@tagName[operand]
where tagName
is the name of the tag. When the rule is triggered by some text, an instance of tag tagName
is added to all the tokens matched by operand
.
The level of the rule determines the order in which the rule is evaluated and the level on which added tag instances will sit: the higher the value, the later the rule will be evaluated. For instance, if you have a rule with TAGGER(10)
and another rule with TAGGER(2)
, the former will be evaluated first.
This means that tagging rules can refer, in their condition, to tags added by lower-level rules.
As for other types of rules, tagging rules must be contained in a scope specification.
Tag-prefixed operands can be combined with other simple or tag-prefixed operands to create complex conditions like:
operand
operator
@tagX[operand]
operator
@tagY[operand]
operator
operand
The same tag can be referenced multiple times in the same rule, for example:
SCOPE SENTENCE
{
TAGGER()
{
@DATE[KEYWORD("tomorrow")]
<>
@DATE[TYPE(DAT)]
}
}
The rule adds an instance of tag DATE to the token with literal text tomorrow and another instance to the text corresponding to a date. So. in the case of this text:
He said: "It's not mandatory that you return the book by tomorrow, but you have to return it before September 1st".
the first instance of the tag will be added to token tomorrow and the second instance will span the tokens September 1st.
Adding tag instances to atoms
When a tag-prefixed operand matches an atom, the tag instance is added to the atom and not to the entire token.
For example, in this sentence:
The dog bite on Lucy's hand is still visible.
the collocation dog bite is a token containing atoms dog and bite.
In that case, tagging rule:
SCOPE SENTENCE
{
TAGGER()
{
@PET[KEYWORD("dog")]
}
}
will add an instance of tag PET to atom dog inside dog bite.
This happens when you use these attributes:
in a tag-prefixed operand.
Adding the TOKEN
transformer to the operand, the tag instance spans the entire token containing the atom.
So, by changing the sample rule this way:
SCOPE SENTENCE
{
TAGGER()
{
@PET[KEYWORD("dog")]|[TOKEN]
}
}
the instance of PET is added to dog bite.
Automatic merging
Overlapping instances of the same tag are automatically merged. The resulting instance sits of the same level of the highest level source instance.
For example, consider this tag definition:
TAGS
{
@THUMBUP
}
and the following tagging rules:
TAGGER(20)
{
@THUMBUP[KEYWORD("very good")]
}
...
TAGGER(5)
{
@THUMBUP[KEYWORD("good")]
}
...
TAGGER(10)
{
@THUMBUP[KEYWORD("good time")]
}
If the rules are applied to this text:
I went to Miami and I had a very good time.
level 5 rule is evaluated first and it adds a level 5 instance of tag THUMBUP to good.
Then, level 10 rule is evaluated. Since it would add a level 10 instance of THUMBUP to good time and that instance would completely cover the previous instance, only the last instance is kept.
Finally, level 20 rule is evaluated. It would add an instance of THUMBUP covering very good, which partially overlaps with the level 10 instance. The two instance are merged and only one level 20 instance covering very good time is kept.
How to match tag instances
Tag instances can be matched in the conditions of rules with the TAG
and the BTAG
attributes.
For example, given this declaration:
TAGS
{
@DRUG_CODE,
@DRUG:100008386
}
and these tagging rules:
SCOPE SENTENCE
{
TAGGER()
{
@DRUG_CODE[PATTERN("\d{8}")]
}
TAGGER()
{
@DRUG[KEYWORD("Acetaminophen", "Ibuprofen")]
}
}
with this text:
The use of Ibuprofen (HS Code 30049063) is not recommended in patients with advanced renal disease.
Acetaminophen (HS Code 30049029) is suggested for occasional use in patients with kidney disease.
an instance of tag DRUG_CODE is added to token 30049063 and 30049029 and an instance of tag DRUG is added to tokens Acetaminophen and Ibuprofen.
Also, the syncon ID of Acetaminophen and Ibuprofen is set to 100008386 (drug, medicine, medicinal drug).
Given the tagging above, this very simple extraction rule:
SCOPE SENTENCE
{
IDENTIFY(DRUG)
{
@NAME[TAG(DRUG)]
AND
@HS_CODE[TAG(DRUG_CODE)]
}
}
generates these records with the DRUG template:
NAME | HS_CODE |
---|---|
acetaminophen | 30049029 |
ibuprofen | 30049063 |
Also, this categorization rule:
SCOPE SENTENCE
{
DOMAIN(medicines)
{
SYNCON(100008386)
}
}
gives 20 points to category medicines, 10 due to Acetaminophen and 10 due to Ibuprofen because, due to the addition of an instance of the DRUG tag, both tokens have syncon ID 100008386 instead of the original IDs found by text analysis.
Tagging partial or whole textual sequences
Tagging occurs on textual elements defined in the tagging rule through the tag definition.
You can use tags for:
- Single textual elements.
- Whole textual sequences.
- Partial textual sequences.
For single textual elements, see the example above.
In case of whole textual sequences, you can use the SEQUENCE
transformer, while in case of partial textual sequences, use composition instead.
Consider these tags and the tagging rule:
TAGS
{
@TAG1,
@TAG2,
@TAG3
}
SCOPE SENTENCE
{
TAGGER()
{
TYPE(ART)
<1:2>
@TAG1[LEMMA("product")]|[SEQUENCE]
<1:2>
LEMMA("developer")
}
}
If you apply this tagging rule to this input text:
Mark and John are the product developers.
you will get this tagging record:
Tag level: 10000
Tag name | Value |
---|---|
TAG1 | the product developers |
As you can see, with the SEQUENCE
transformer the tag spans over the whole textual sequence defined in your tagging rule, and not limiting itself to the lemma product.
To tag a partial and more precise textual sequence of your choice, you can use composition.
Note
Using composition will tag all the tokens between the tokens affected by composition.
If you have the same tags defined above but these rules with composition:
SCOPE SENTENCE
{
TAGGER()
{
TYPE(ART)
>>
@TAG1[LEMMA("product")]|[#1]
<1:11>
@TAG1[LEMMA("developer")]|[#2]
}
}
SCOPE SENTENCE
{
IDENTIFY(TEST)
{
@FIELD1[TAG(TAG1)]
}
}
applied to this text:
The product of the company is managed by a team of seven developers.
you will get this tagging record:
Tag level: 10000
Tag name | Value |
---|---|
TAG1 | product developers |
and what follows in the Semantic Analysis tool window:
By adding the tag to the lemma developer plus composition beside each lemma, you obtained product developers as output, without the definite article the. Similar to SEQUENCE
, the tag spans over the whole sequence of tokens in the disambiguator—starting from product to developers—and is entirely grouped as output in the Extraction tool window.
The chunks in a tagging rule are used as begin and end of the sequence to be considered and the elements in the middle are also tagged.
The whole sequence of tokens is tagged, the value of the TAG1 tag is product developers corresponding to its TagEntry, namely the tokens tagged in the tagging rule.
Note
More information about TagEntry below.
If you invert the order of the components, like this (for demonstrative purposes):
SCOPE SENTENCE
{
TAGGER()
{
TYPE(ART)
>>
@TAG1[LEMMA("product")]|[#2]
<1:11>
@TAG1[LEMMA("developer")]|[#1]
}
}
SCOPE SENTENCE
{
IDENTIFY(TEST)
{
@FIELD1[TAG(TAG1)]
}
}
applied to the same text above, you will get developers product as tagging output, but nothing as extraction output. Further, the whole sequence of tokens is not tagged in the disambiguator.
Tag instance value
A dynamic property of each tag instance is its value, also called tag entry.
The value of a tag instance depends on how the instance is generated. If it is generated by a tagging rule, the tag entry is the conventional value of the matched attribute.
For example, when this tag declaration and this tagging rule:
TAGS
{
@DOG_BREED
}
SCOPE SENTENCE
{
TAGGER()
{
@DOG_BREED[LEMMA("doberman")]
}
}
are applied to this text:
Dobermans are beautiful.
an instance of tag DOG_BREED is added to Dobermans and its value is Doberman, that is the value of the LEMMA
attribute for that text token.
This base value can be optionally altered using value transformers.
If the instance is generated by the script with a method of the pre-defined DIS
object, the value of the text varies with the method.
The tag entry can be assigned to fields in extraction rules using the TAGENTRY
value transformer.
Managing tags with JavaScript
There are event handling functions that are called during and immediately after applying tagging rules. onTaggerLevel()
is called after applying the tagging rules of a given level, onTagger()
is called after applying all the tagging rules.
Tag instances can be managed by script with the appropriate methods of the pre-defined DIS
object.