Segments
Documents segmentation is one of the two techniques a developer can use to create custom textual subdivisions in an input document (the other being sectioning). This technique is particularly useful when the original text layout or structure turns out to be fundamental for the correct identification and/or retrieval of information.
Segments are dynamic text blocks which are identified during the processing of input texts by means of custom linguistic rules.
For example, consider a plain-text input document such as the following:
Ingredients
1 teaspoon olive oil
1 cup diced zucchini
1/2 cup minced onion
1 clove garlic, peeled and minced
2 cups diced fresh tomatoes
2 tablespoons chopped fresh basil
1/4 teaspoon salt
1/4 teaspoon ground black pepper
4 (6 ounce) halibut steaks
1/3 cup crumbled feta cheese
PREP 15 mins
COOK 15 mins
READY IN 30 mins
Directions
1.preheat oven to 450 degrees F (230 degrees C). Lightly grease a shallow baking dish.
2.heat olive oil in a medium saucepan over medium heat and stir in zucchini, onion, and garlic. Cook and stir 5 minutes or until tender. Remove saucepan from heat and mix in tomatoes, basil, salt, and pepper.
3.arrange halibut steaks in a single layer in the prepared baking dish. Spoon equal amounts of the zucchini mixture over each steak. Top with feta cheese.
4.bake 15 minutes in the preheated oven, or until fish is easily flaked with a fork.
We can agree that such a document is recognizable at first glance as a recipe. It’s easy for a human being to browse the text content and take a quick glimpse at the layout to understand where the ingredients are listed and where to look for cooking directions. However, this document is not electronically marked up, so it isn't as easy for a text processing tool to recognize these same elements.
Text segmentation allows the user to reconstruct this mark-up, so that the visual structure can be “translated” to a machine-readable format and used for automatic text processing.
The rules used to identify dynamic segments are similar to categorization rules. However, unlike categorization rules, when a segmentation rule identifies a concept in a text, it will neither provide a score to a category, nor link a document to any category. Segmentation rules are created to identify the boundaries of a relevant text block by searching for specific terms and phrases and using them as reference points. The concepts identified by segmentation rules act as “Cartesian coordinates” which help the developer and the software to outline the segments contained in the entire text.
Segments are identified by meaningful names—usually chosen by the developer—so that they can easily be referenced throughout a project. Before starting with the development of segmentation rules, every segment must be declared in a .cr
file. The default file config.cr
already contains a declaration:
SEGMENTS
{
@SEGMENT1(1.0),
@SEGMENT2(1.0)
}
Note
The parameter inside the round brackets is reserved for future use.
Once the segments’ names have been declared, it is possible to develop as many segmentation rules as needed.
The final goal of document segmentation is to be able to use the dynamic segments as scope options for both categorization and extraction rules. In other words, a rule can be set so that it is applied only to one or more text blocks which were tagged as segments. This will perform a kind of preliminary selection on the whole text in order to identify the portion that is more likely to contain the relevant information to be categorized or extracted.