Conditional Random Fields (CRF)

Description

Conditional Random Fields (CRF) are probabilistic, discriminative models designed for sequence labeling, which are used to assign tags over tokenized atoms of text.

CRF models jointly exploit the conditional probability distribution over label transitions and observable features. As text can be represented as a sequence of symbols, the underlying graphical model used by the CRF is a simple linear chain.

For entity extraction, the sequence labeling task is currently modeled through the BIO (BEGIN, INNER, OTHER) format, which can be presumed as follows:

anything that should be excluded from labeling is marked as OTHER, while entities that must be extracted will have a BEGIN_CLASS label over the first token and an INNER_CLASS label over all remaining tokens, until another BEGIN_CLASS or an OTHER symbol is found.

Properties

CRF is a very powerful algorithm when working with plain categorical/discrete features (for example symbols such as word forms or part-of-speech tags), but they are not well suited for dense-vector data representations (for example word embeddings).

With moderately-sized training sets, the model size tends to be small, but the model grows in size (affecting both disk and memory consumption) when further annotated data is available, becoming unsustainable especially when large context windows are used.

Without setting accurate regularization parameters (c1 and c2), CRF may tend to over-memorize patterns in data, leading to prediction performances with good precision measures but lower than expected recall. To handle this issue, it's possible to enable F-beta optimization for low beta values (lower than 1 means giving more weight to recall than precision).

With coherent annotation and rich symbolic features that produce strong regularities in data, CRF can perform very effectively.

In contrast, if annotations have a certain degree of incoherence, the training process tends to struggle, the generalization becomes harder to achieve, and the model starts "memorizing" only a small list of very strong regular patterns.

In the presence of very sparse annotations (for example long documents with very few passages where labels must be detected), CRF models may start to suffer from the over-representation of OTHER (any token where nothing should be predicted is considered OTHER), but it is also possible that other algorithms (for example transformers) may suffer from this issue even more.

Hyperparameters

The hyperparameters for this model type are:

CRF c1 regularization coefficient
CRF c2 regularization coefficient