Segmentation is a way—the others being sections and document layout—to have custom textual subdivisions in an input document, in addition to the default subdivisions that the disambiguator automatically detects.
Segmentation is useful when the organization of the text inside the document is fundamental for the correct identification and retrieval of information and when the organization itself is not know "a priori", that is it's not provided as additional information together with the input text.
In fact, while sections and document layout are defined as metadata inside the input document—so they "come from outside" the model—segments are blocks of text which are dynamically identified by the model itself with specific rules in an early phase of the document processing pipeline.
For example, consider a plain-text input document such as the following:
1 teaspoon olive oil
1 cup diced zucchini
1/2 cup minced onion
1 clove garlic, peeled and minced
2 cups diced fresh tomatoes
2 tablespoons chopped fresh basil
1/4 teaspoon salt
1/4 teaspoon ground black pepper
4 (6 ounce) halibut steaks
1/3 cup crumbled feta cheese
PREP 15 mins
COOK 15 mins
READY IN 30 mins
1.preheat oven to 450 degrees F (230 degrees C). Lightly grease a shallow baking dish.
2.heat olive oil in a medium saucepan over medium heat and stir in zucchini, onion, and garlic. Cook and stir 5 minutes or until tender. Remove saucepan from heat and mix in tomatoes, basil, salt, and pepper.
3.arrange halibut steaks in a single layer in the prepared baking dish. Spoon equal amounts of the zucchini mixture over each steak. Top with feta cheese.
4.bake 15 minutes in the preheated oven, or until fish is easily flaked with a fork.
If we want to create an efficient model that extracts ingredients, is better to limit the extraction to the initial part of the document where ingredients are listed, that is between Ingredients and Directions. With segmentation rules it's easy to create, for every input recipe, an "ingredients" segment using the aforementioned words as boundaries, then use the
SEGMENT scope in extraction rules to limit the extraction to the segment text. Segments can be used for categorization as well.
The next article describes the syntax of segmentation rules.