Skip to content

Extract converter

Description

The Extract Converter processor recognizes blocks of text in PDF documents and returns them together with supplemental information like the table of contents, the fonts, the number of pages, the author, the creation date and any PDF metadata.

The blocks of text are extracted page by page in the same order a human would read them.
When extracting text, Extract Converter can use Optical Character Recognition (OCR) to get plain text from images that represent text.

It is used as the first block of a workflow when it is certain that only PDF documents will be analyzed, as an alternative to the Tika Converter processor. Tika Converter does not have OCR and Extract Converter, especially in case of documents with complex layouts—multiple columns, text boxes, figures—returns the text in an order more similar to that in which a human would read it, therefore more suited to being analyzed effectively with a model.

Note

Because of OCR (which is optional and disabled by default, it can be activated by configuring the processor block) and the other sophisticated analyzes carried out, Extract Converter may take much longer than Tika Converter to process a document.

Block properties

Extract Converter workflow blocks have the following properties:

  • Common block properties:

    • The unique block ID.
    • Block name: the block name, it can be edited.
    • Description: the description of the processor (read only).
  • Type Specific tab:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
  • Deployment tab:

    • Replicas: number of required instances (3 maximum)
    • Memory: required memory
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs)
  • Functional tab:

    • Enable or disable table and title detection: recognition of tables and titles (enabled by default)
    • Enable or disable OCR extraction: Optical Character Recognition applied to document's images (disabled by default)
    • Specify OCR language: when OCR is enabled, the language or script to be used to determine text. Possible choices are:

      Value Description
      eng English
      chi_sim Chinese (simplified)
      chi_tra Chinese (traditional)
      hin Hindi
      spa Spanish
      fra French
      ara Arabic
      ben Bengali
      rus Russian
      por Portuguese
      ind Indonesian
      ita Italian
      latn Latin script (any language with Latin characters)

      For multi-language documents, concatenate language codes with a plus character (+), for example: eng+spa+ita.

Input

Read the NL Flow API manual for the description of the JSON object to submit to an Extract Converter block.

Output

Structure

The output of an Extract Converter block is a JSON object with the following structure:

{
    "result": {
        "header": {},
        "layout": [],
        "words": [],
        "tableOfContents": [],
        "fonts": [],
    }
}

where:

  • result contains the extraction task results, as detailed below.
  • header contains general information about the document and the text extraction task.
  • layout contains the document's layout, i.e. the text organized in a hierarchical structure with pages and corresponding text blocks.
  • words contains the document's text in words, without layout information.
  • tableOfContents contains the document table of contents (TOC), if any.
  • fonts contains a list of the fonts used in the document.

Output-input mapping

Typically, in the workflow, the Extract Converter block is followed by a model block or more model blocks in parallel.
In these cases, inside the model block's configuration, it's enough to map the result property of the Extract Converter block's output to the documentLayout input property.

The model block "knows" how to get plain text out of the result property and there's no need to know more about the Extract Converter output. For any other need, see the detailed description of the output below.

The header object contains information about the whole document and the extraction task. Its contents are:

  • conversionDateTime: extraction task end date and time.
  • customInfo: PDF document properties:

    • Author: author
    • CreationDate: creation date and time1
    • Creator: creator
    • ModDate: last modification date and time1
    • Producer: generator application
    • Title: document title
  • documentName: document name.

  • errorPages: number of pages that could not be analyzed (present only in case of errors).
  • options: extraction task options, for troubleshooting only.
  • totPages: total number of pages.
  • version: software version for the Extract Converter processor.
  • metadata: an array of PDF metadata.
    Metadata is optional data that the PDF editor can insert into pages. This data is not displayed on the page, but is associated with visible elements.

    Each metadata can have these properties:

    • bbox: array containing the coordinates2 of the metadata bounding box.

      • item 0: upper left corner X
      • item 1: upper left corner Y
      • item 2: lower right corner X
      • item 3: lower right corner Y
    • key: metadata key, its name.

    • page: number of the page where the metadata is located.
    • value: metadata value.

    For example:

    "metadata": [{
            "bbox": [146, 207, 419, 228],
            "key": "txtPolicyNumber",
            "page": 3,
            "value": "PACUIC001101-07 "
        }, {
            "bbox": [39, 426, 417, 357],
            "key": "txtNamedInsuredAndAddress",
            "page": 3,
            "value": "SWEET FRUIT ASSOCIATION INC.\r\n7100 APRICOT WAY\r\nST. PETERSBURG, FL  33706 "
       }, {
            "bbox": [421, 356, 829, 438],
            "key": "AgencyNameAndAddress",
            "page": 3,
            "value": "StaySafe Insurance Services, Inc.\r\n2502 N Rodeo Drive\r\nTampa, FL  33607 "
        }, {
            "bbox": [144, 254, 283, 275],
            "key": "txtEffectiveDate",
            "page": 3,
            "value": "4/27/2022 "
        }
    ]
    

layout

layout is an array containing all the layout elements recognized in the document.
The order of the elements inside the array reflects the sequence of pages, so all the elements of page 1 are found first, then those of page 2, and so on.
Within the elements of a page, the first element represents the page itself and the other elements are blocks of text, tables or table cells. The position of text blocks and tables in the array corresponds to what Extract Converter assumed to be the order in which a human would read them on the page.

Elements can represent:

  • Pages (only the bounding box)
  • Titles
  • Headers
  • Footers
  • Body-level text blocks
  • Tables (only the bounding box)
  • Table cells

The properties that each element can have are:

Element type→
Property
Pages Titles Headers & footers Body-level text blocks Tables Table cells
id X X X X X X
type X X X X X X
page X X X X X X
children X X
parent X X X X X
content X X X X
bbox X X X X X X
label X
row X
column X
isHead X
span X

Properties are:

  • id: element ID, every element has a unique value for this property.
  • type: element type. Possible values are:

    Element type type value Description
    Pages page The element is the "container" (it has no text of its own) of all the textual elements displayed on a page.
    Body-level text blocks text The element is a block of text (e.g. a paragraph, a text box) at the body-level, i.e. not a title.
    Titles title The element is a heading.
    Tables table The element is the "container" (it has no text of its own) of all the element (cells) of a table.
    Table cells cell The element is a table's cell.
    Header header The element is a page header.
    Footers footer The element is a page footer.
  • page: page number

  • children: list of child blocks' IDs, only in page and table elements. This property is an array, each item of which is the value of the id property of an element that is hierarchically a child of this element. For example, the titles in a page are children of the page element, the cells of a table are children of a table element.
  • parent: parent element ID. In case of table cells, the value of this property is the value of the id property of the table element, while for title, text, header & footer and table elements, it is the value of the id property of a page element. Page elements don't have this property because their "parent" is the document itself.
  • label: for titles, it specifies the title level as in tableOfContents.
  • content: text of the element, this property is absent in page and table elements, which are "containers".
  • bbox: array containing the coordinates2 of the element's bounding box.

    • item 0: upper left corner X
    • item 1: upper left corner Y
    • item 2: lower right corner X
    • item 3: lower right corner Y
  • row: cell row number.

  • column: cell column number.
  • isHead: set to true if the cell is a column header.
  • span: cell span. It's an array of integer numbers. When present, the cell spans over more than one row and/or columns. The first item of the array is the row span, the second is the column span.

words

The words array contains one item per page and each item represents, in an encoded and compressed form, all the words present on the page.

The value of the single item is encoded in Base64.
The decoded value is a byte array in gzip format. The expanded byte array value is another byte array in which each word corresponds to a variable-length sequence of bytes with this structure:

UTF-8 encoded text0x00Parent element IDFont IDBounding box coordinates
  • UTF-8 encoded text is the text of the word.
  • Parent element ID is four bytes long and must be interpreted as a little-endian integer. The value is the ID—the value of the id property—of the layout element in which the word is located.
  • Font ID is four bytes long and must be interpreted as a little-endian integer. The value is the ID of the font with which the word is written in the document, so it coincides with the value of the id property of the item of the fonts array that represents the font.
  • Bounding box coordinates is 16 bytes long and consists of four parts of four bytes each. Each part must be interpreted as a little-endian integer. The parts are the coordinates2 of the word bounding box and, taken from left to right, have this meaning.

    1. upper left corner X
    2. upper left corner Y
    3. lower right corner X
    4. lower right corner Y

tableOfContents

This array represents the table of contents (TOC) of the document, if any, obtained both from explicit information of the PDF document and from the visual examination of the pages.
Each item in the array represents an entry in the TOC and has these properties:

Name Description
score Item recognition confidence score
level Title level on the titles' hierarchy. The value of this property coincides with the value of the label property of the corresponding title element in layout (see the layoutId property below)
source Only for troubleshooting
layoutId Cross-reference to the layout element. The value of this property coincides with the value of the id property of the corresponding title element in layout
content TOC item text

For example:

{
    "score": 0.8755,
    "level": 1,
    "source": "d",
    "layoutId": 2,
    "content": "UMBRELLA LIABILITY POLICY SCHEDULE"
}

fonts

This array lists the fonts used in the document's text. Each item in the array represents a font and has these properties:

Key Description
id Unique ID
id_name Original font name
name Normalized font name
bold true if bold, false otherwise
italic true if italic, false otherwise
ocr true if recognized through OCR, false otherwise

For example:

{
    "name": "Arial",
    "bold": false,
    "id": 1,
    "italic": false,
    "id_name": "Arial",
    "ocr": false
}

  1. PDF defines a standard date format similar to the international standard Abstract Syntax Notation One (ASN.1), defined in ISO/IEC 8824. A date-time is a string with this format:

    D:YYYYMMDDHHmmSSOHH'mm'
    

    where

    • YYYY is the year
    • MM is the month
    • DD is the day (01-31)
    • HH is the hour (00-23)
    • mm is the minute (00-59)
    • SS is the second (00-59)
    • O is the relationship of local time to Universal Time (UT), denoted by one of the characters +, -, or Z (see below)
    • HH followed by ' is the absolute value of the offset from UT in hours (00-23)
    • mm followed by ' is the absolute value of the offset from UT in minutes (00-59)

    A plus sign (+) as the value of the O field signifies that local time is later than UT, a minus sign (-) that local time is earlier than UT, and the letter Z that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT is considered to be unknown. Whether or not the time zone is known, the rest of the date is specified in local time.
    For example, December 23, 2022, at 7:52 PM, U.S. Pacific Standard Time, is represented by the string:

    D:20221223195200-08'00'
    

    OR

    D:20220327195230+05'00'
    

  2. Coordinates are in pixels and referred to a 100 DPI (dots per inch) rendering of the page. The coordinates origin is at the top left corner of the rendered page.