Extract Converter processor

Description

The Extract Converter processor takes PDF documents as input and, page by page, extracts the text blocks together with their graphic layout. For each block, it determines the position on the page, the size and the type choosing between heading, body part, header, footer and table cell.
The processor also obtains further information such as the number of pages, author and creation date, the table of contents, the list of fonts used in the document and any PDF metadata.

When extracting text, Extract Converter can use Optical Character Recognition (OCR) to get plain text from images that represent text.

It is used as the first block of a workflow when it is certain that only PDF documents will be analyzed, as an alternative to the TikaTessaract Converter processor. With respect to TikaTesseract Converter, Extract Converter, especially in case of documents with complex layouts—multiple columns, text boxes, figures—returns the text in an order more similar to that in which a human would read it, therefore more suited to being analyzed effectively with a model.

Old version

There are two versions of the component available: the latest and 1.0.0. This version is present for backward compatibility with old workflows created with previous versions of NL Flow. For new workflows always use the latest version.

The blocks of version 1.0.0 are identical to those of version 1.9 of NL Flow, however something has changed in the block properties:

The name of the block, instead of editing the Block name field, can be modified by selecting Edit component name .
The Output tab has been added (see below).

All other block properties remain unchanged, so if you are dealing with a version 1.0.0 block refer to the NL Flow version 1.9 documentation and ignore the following here except for the Output tab.

Input

The input to a Extract Converter block must be a JSON like this:

{
  "base64": (string) Base64 encoding of a PDF file,
  "path": (string) Filename or path of the PDF file
}

path is only for debugging, logging or auditing purposes, it is not used to "read" the file, which is completely represented by the value of the base64 key.

Block properties

Block properties can be set by editing the block.
Extract Converter workflow blocks have the following properties:

Basic properties:
- Block name, it can be edited
- Component version (read only)
- Block ID (read only)

Functional:

Enable or disable table and title detection: toggles the recognition of tables and headings (enabled by default).
Enable or disable OCR extraction: toggles Optical Character Recognition (OCR) to extract text from images (disabled by default).

Specify OCR language: when OCR is enabled, the language or script to be used to determine text. Possible choices are:

Value	Description
`eng`	English
`chi_sim`	Chinese (simplified)
`chi_tra`	Chinese (traditional)
`hin`	Hindi
`spa`	Spanish
`fra`	French
`ara`	Arabic
`ben`	Bengali
`rus`	Russian
`por`	Portuguese
`ind`	Indonesian
`deu`	German
`ita`	Italian
`latn`	Latin script (any language with Latin characters)

For multi-language documents, concatenate language codes with a plus character (+), for example: eng+spa+ita.

Enable/disable table of content extraction: toggles the extraction of the table of content, if any (enabled by default).
Enable/disable font extraction: toggles the extraction of the list of fonts used in the document (enabled by default).
Reading order mode: text extraction algorithm. Possible values are:
- standard: the algorithm tries to go from one text block to the next the way a human would do when reading the page. Best for multi-column or mixed layout pages.
- vertical: the algorithm considers the page as single-column and reads text blocks from left to right, from top to bottom.
- auto: the algorithm classifies each page based on its layout then automatically chooses between standard and vertical to extract text.

Deployment:
- Timeout: execution timeout expressed in minutes (m) or seconds (s).
- Replicas: number of required instances.
- Memory: required memory.
- CPU: thousandths of a CPU required (for example: 1000 = 1 CPU).
- Consumer Number: number of threads of the consumer, the software module of the block that provides input to process by taking it from the block's work queue.
Input: these properties correspond to the top level keys of the expected input JSON.
They are read-only—so only descriptive of the expected input—when the block is the first in a data flow and the workflow's input has not been explicitly described. In that case the workflow's input JSON must contain the expected keys, with the same name and type.
They are editable otherwise, and must be set through fully automatic, assisted or manual input mapping.
Output: read-only, this property is a navigable description of the structure of the output array.

Output

Structure

The output of an Extract Converter block is a JSON object with the following structure:

{
    "result": {
        "annotations": [],
        "fonts": [],
        "header": {},
        "layout": [],
        "tableOfContents": [],
        "timeInfo": []
        "words": []
    }
}

where:

annotations is reserved for future use.
fonts (array:): it contains a list of the fonts used in the document.
header (object): it contains general information about the document and the text extraction task.
layout (array): it contains the document's layout, that is the text organized in a hierarchical structure with pages and corresponding text blocks.
tableOfContents (array): it contains the document table of contents (TOC), if any.
timeInfo (object): it contains troubleshooting information.
words (array): it contains the document's text in words, without layout information.

fonts

This array lists the fonts used in the document's text. Each item in the array represents a font and has these properties:

bold (boolean): true if bold, false otherwise
id (integer): unique font ID assigned during the analysis
id_name (string): original font name
italic (boolean): true if italic, false otherwise
name (string): normalized font name
ocr (boolean): true if recognized through OCR, false otherwise
pdf_name (string): PDF font name
strikethrough (boolean): true if struck through, false otherwise
underline (boolean): true if underlined, false otherwise

Note

pdf_name may not be found after an analysis.

For example:

{
    "bold": false,
    "id": 1,
    "id_name": "Arial",
    "italic": false,
    "name": "Arial",
    "ocr": false
}

Note

If a font is not detected the key name is set to mix.

The header object contains information about the whole document and the extraction task. The properties are:

conversionDateTime (string): extraction task end date and time.
customInfo (object): PDF document properties:
- Author: author
- CreationDate: creation date and time¹
- Creator: creator
- ModDate: last modification date and time¹
- Producer: generator application
- Title: document title
documentName (string): document name.
errorPages (integer): number of pages that could not be analyzed (present only in case of errors).
options (object): extraction task options, for troubleshooting only.
totPages (integer): total number of pages.
version (string): software version for the Extract Converter processor.

metadata (array): PDF metadata.
Metadata is optional data that the PDF editor can insert into pages. This data is not displayed on the page, but is associated with visible elements.

Each metadata can have these properties:

bbox (array): it contains the coordinates² of the metadata bounding box.
- item 0: upper left corner X
- item 1: upper left corner Y
- item 2: lower right corner X
- item 3: lower right corner Y
key (string): metadata key, its name.
page (integer): number of the page where the metadata is located.
value (string): metadata value.

For example:

"metadata": [{
        "bbox": [146, 207, 419, 228],
        "key": "txtPolicyNumber",
        "page": 3,
        "value": "PACUIC001101-07 "
    }, {
        "bbox": [39, 426, 417, 357],
        "key": "txtNamedInsuredAndAddress",
        "page": 3,
        "value": "SWEET FRUIT ASSOCIATION INC.\r\n7100 APRICOT WAY\r\nST. PETERSBURG, FL  33706 "
   }, {
        "bbox": [421, 356, 829, 438],
        "key": "AgencyNameAndAddress",
        "page": 3,
        "value": "StaySafe Insurance Services, Inc.\r\n2502 N Rodeo Drive\r\nTampa, FL  33607 "
    }, {
        "bbox": [144, 254, 283, 275],
        "key": "txtEffectiveDate",
        "page": 3,
        "value": "4/27/2022 "
    }
]

layout

layout is an array containing all the layout elements recognized in the document.
The order of the elements inside the array reflects the sequence of pages, so all the elements of page 1 are found first, then those of page 2, and so on.
Within the elements of a page, the first element represents the page itself and the other elements are blocks of text, tables or table cells. The position of text blocks and tables in the array corresponds to what Extract Converter assumed to be the order in which a human would read them on the page.

Elements can represent:

Pages (only the bounding box)
Titles
Headers
Footers
Body-level text blocks
Tables (only the bounding box)
Table cells
TOC items

The properties that each element can have are:

Element type→ Property ↓	Pages	Titles	Headers & footers	Body-level text blocks	Tables	Table cells	TOC items
`id`	X	X	X	X	X	X	X
`type`	X	X	X	X	X	X	X
`page`	X	X	X	X	X	X	X
`children`	X				X
`parent`		X	X	X	X	X	X
`content`		X	X	X		X	X
`bbox`	X	X	X	X	X	X	X
`label`		X
`row`						X
`column`						X
`isHead`						X
`span`						X
`relativePage`	X

Properties are:

id (integer): element ID, every element has a unique value for this property.

type (string): element type. Possible values are:

Element type	`type` value	Description
Pages	`page`	The "container" (it has no text of its own) of all the textual elements displayed on a page.
Body-level text blocks	`text`	A block of text (e.g. a paragraph, a text box) at the body-level, i.e. not a title.
Titles	`title`	A heading.
Tables	`table`	The "container" (it has no text of its own) of all the element (cells) of a table.
Table cells	`cell`	A table's cell.
Header	`header`	A page header.
Footers	`footer`	A page footer.
TOC items	`toc`	An item of the table of contents.

page (integer): page number
children (array): list of child blocks' IDs, only in page and table elements. Each item of the array is the value of the id property of an element that is hierarchically a child of this element. For example, the titles in a page are children of the page element, the cells of a table are children of a table element.
parent (integer): parent element ID. In case of table cells, the value of this property is the value of the id property of the table element, while for title, text, header & footer and table elements, it is the value of the id property of a page element. Page elements don't have this property because their "parent" is the document itself.
label (string): for titles, it specifies the title level with a label with this structure:

H#

where:
- H stands for Header.
- # is an integer representing the title level.
# corresponds to the same numerical value of the level property in tableOfContents.
content (string): text of the element, this property is absent in page and table elements, which are "containers".
bbox (array): it contains the coordinates² of the element's bounding box.
- Item 0: upper left corner X
- Item 1: upper left corner Y
- Item 2: lower right corner X
- Item 3: lower right corner Y
row (integer): cell row number.
column (integer): cell column number.
isHead (boolean): set to true if the cell is a column header.
span (array): cell span expressed in integer numbers. When present, the cell spans over more than one row and/or columns. The first item of the array is the row span, the second is the column span.
relativePage (string): the page label printed on the page. For example, page 4 could be labelled IV.

score (number): item recognition confidence score
level (integer): title level on the titles' hierarchy. The value of this property coincides with the integer associated to the label property of the corresponding title element in layout.
source (string): only for troubleshooting
layoutId (integer): cross-reference to the layout element. The value of this property coincides with the value of the id property of the corresponding title element in layout
content (string): TOC item text

For example:

{
    "score": 0.8755,
    "level": 1,
    "source": "d",
    "layoutId": 2,
    "content": "UMBRELLA LIABILITY POLICY SCHEDULE"
}

words

The words array contains one item per page and each item represents, in an encoded and compressed form, all the words present on the page.

The value of the single item is encoded in Base64.
The decoded value is a byte array in gzip format. The expanded byte array value is another byte array in which each word corresponds to a variable-length sequence of bytes with this structure:

UTF-8 encoded text0x00Parent element IDFont IDBounding box coordinates

UTF-8 encoded text is the text of the word.
Parent element ID is four bytes long and must be interpreted as a little-endian integer. The value is the ID—the value of the id property—of the layout element in which the word is located.
Font ID is four bytes long and must be interpreted as a little-endian integer. The value is the ID of the font with which the word is written in the document, so it coincides with the value of the id property of the item of the fonts array that represents the font.
Bounding box coordinates is 16 bytes long and consists of four parts of four bytes each. Each part must be interpreted as a little-endian integer. The parts are the coordinates² of the word bounding box and, taken from left to right, have this meaning.
1. upper left corner X
2. upper left corner Y
3. lower right corner X
4. lower right corner Y

Output-input mapping

The result top level key in the output JSON is compatible with the documentLayout input property of model blocks.

PDF defines a standard date format similar to the international standard Abstract Syntax Notation One (ASN.1), defined in ISO/IEC 8824. A date-time is a string with this format:
```
D:YYYYMMDDHHmmSSOHH'mm'
```
where
- YYYY is the year
- MM is the month
- DD is the day (01-31)
- HH is the hour (00-23)
- mm is the minute (00-59)
- SS is the second (00-59)
- O is the relationship of local time to Universal Time (UT), denoted by one of the characters +, -, or Z (see below)
- HH followed by ' is the absolute value of the offset from UT in hours (00-23)
- mm followed by ' is the absolute value of the offset from UT in minutes (00-59)
A plus sign (+) as the value of the O field signifies that local time is later than UT, a minus sign (-) that local time is earlier than UT, and the letter Z that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT is considered to be unknown. Whether or not the time zone is known, the rest of the date is specified in local time.
For example, December 23, 2022, at 7:52 PM, U.S. Pacific Standard Time, is represented by the string:
```
D:20221223195200-08'00'
```
OR
```
D:20220327195230+05'00'
```
↩↩
Coordinates are in pixels and referred to a 100 DPI (dots per inch) rendering of the page. The coordinates origin is at the top left corner of the rendered page. ↩↩↩