Scripting overview
The scripting language
As mentioned in the introduction to Studio languages, scripting is a completely optional but very powerful way to customize the document analysis pipeline beyond the possibilities offered by the categorization and extraction rule languages.
The project script, split into event handling functions, is executed during the various phases of document analysis.
Studio's scripting language conforms to the Standard ECMA-262 5.1 Language Specification and can therefore be defined as "JavaScript like".
JavaScript is widespread and typically used inside Web browsers for making Web pages dynamic, but not only that, think for example of Node.js.
There is a lot of educational material on the Web for learning JavaScript, for example:
The main.jr file
Text intelligence engines created with Studio execute the script defined in the main.jr
file.
By default, when you create a project with Studio, the main.jr
file only defines the initialize
and the shutdown
functions and contains the commented out prototypes of other functions.
In this state, the script doesn't affect the engine's results, which are thus solely determined by rules, but if you uncomment one or more functions and put specific code inside them, you can control and extend the document analysis pipeline.
Event handlers
All of the predefined or commented out functions in main.jr
are event handlers, namely portions of code automatically executed before or after a specific processing event.
The initialize
function is executed when the engine starts, while the shutdown
function is executed immediately before the engine is stopped. The other functions are called at specific moments of the document analysis pipeline.
The phases of the pipeline and the events that are fired after those phases are shown in the following figure. Events are listed inside the dashed area.
The handling functions corresponding to events are listed in the following table.
Event | Event handling function |
---|---|
Prepare | onPrepare |
Tagger | onTagger , onTaggerLevel |
Categorizer | onCategorizer |
Finalize | onFinalize |
The following articles in this section describe what you can do within each of these event handling functions, while specific articles are devoted to:
- Predefined functions
- Predefined objects:
- The
CTX
object returns information about the categorization and extraction processes. - The
DIS
object gives access to the results of the disambiguation phase. - The
LAY
object gives access to the PDFs layout. - The
REX
object allows for regular expression-based find & replace operations. - The
UTL
object provides helper utilities. - The
XML
object is used to navigate a Studio project taxonomy file.
- The
- Predefined and custom modules.
Debugging
The script can be debugged with the Studio built-in debugger.