Skip to content

Scripting overview

The scripting language

As mentioned in the introduction to Studio languages, scripting is a completely optional but very powerful way to customize the document analysis pipeline beyond the possibilities offered by the categorization and extraction rule languages.
The project script, split into event handling functions, is executed during the various phases of document analysis.

Studio's scripting language conforms to the Standard ECMA-262 5.1 Language Specification and can therefore be defined as "JavaScript like".
JavaScript is widespread and typically used inside Web browsers for making Web pages dynamic, but not only that, think for example of Node.js.
There is a lot of educational material on the Web for learning JavaScript, for example:

The main.jr file

Text intelligence engines created with Studio execute the script defined in the main.jr file.

By default, when you create a project with Studio, the main.jr file only defines the initialize and the shutdown functions and contains the commented out prototypes of other functions.

In this state, the script doesn't affect the engine's results, which are thus solely determined by rules, but if you uncomment one or more functions and put specific code inside them, you can control and extend the document analysis pipeline.

Event handlers

All of the predefined or commented out functions in main.jr are event handlers, namely portions of code automatically executed before or after a specific processing event.
The initialize function is executed when the engine starts, while the shutdown function is executed immediately before the engine is stopped. The other functions are called at specific moments of the document analysis pipeline.
The phases of the pipeline and the events that are fired after those phases are shown in the following figure. Events are listed inside the dashed area.

The handling functions corresponding to events are listed in the following table.

Event Event handling function
Prepare onPrepare
Tagger onTagger, onTaggerLevel
Categorizer onCategorizer
Finalize onFinalize

The following articles in this section describe what you can do within each of these event handling functions, while specific articles are devoted to:

  • Predefined functions
  • Predefined objects:
    • The CTX object returns information about the categorization and extraction processes.
    • The DIS object gives access to the results of the disambiguation phase.
    • The LAY object gives access to the PDFs layout.
    • The REX object allows for regular expression-based find & replace operations.
    • The UTL object provides helper utilities.
    • The XML object is used to navigate a Studio project taxonomy file.
  • Predefined and custom modules.

Debugging

The script can be debugged with the Studio built-in debugger.