Skip to content

URL Converter processor

Description

The URL Converter processor gets plain text from a Web page.

Input

The processor requires the input JSON to contain this top level key:

"url": "url"

where:

  • url is the URL of the Web page to get text from.

At runtime NL Flow will use an Internet connection to download the page.

Block properties

Block properties can be set by editing the block.
URL Converter workflow blocks have the following properties:

  • Common:

    • The unique block ID and the service version, displayed in the title bar (read only, displayed also in the block tooltip in the canvas).
    • Block name: the block name, it can be edited.
    • Description: the description of the processor (read only).
  • Type Specific:

    • Timeout: execution timeout expressed in minutes (m) or seconds (s).
  • Deployment:

    • Replicas: number of required instances.
    • Memory: required memory.
    • CPU: thousandths of a CPU required (for example: 1000 = 1 CPUs).
  • Input

    Used for input mapping: one property for each of the top level keys of the input JSON.
    If:

    • The block is the first in a flow and the workflow input contains only the expected keys.

    Or:

    these properties do not need to be set.
    Otherwise, the properties determine which top level keys of the overall "upstream JSON" must be mapped to the block's input keys. The values of the properties must be set choosing from the compatible keys of upstream blocks' output or, if the input format of the workflow has been defined, from the keys of the $nlflow_input pseudo block.

Output and output-input mapping

The output of a URL Converter block is a JSON object with the following structure:

{
    content: extracted text,
    description: value of the description <meta> tag,
    domain: domain name,
    image: URL of a descriptive image found in the page
    keywords: value of the keywords <meta> tag,
    language: ISO 639-1 code of the page language,
    title: page title,
    url: page URL
}

For example:

{
    content:"Saturday, January 29, 2022 - The incumbent President of Italy Sergio Mattarella was re-elected for a second seven-year term yesterday in the eighth round of voting for a potential successor.
    Aged 80, Mattarella repeatedly expressed his desire to leave the position, including renting an apartment in Rome in anticipation of a move from the presidential Quirinal Palace (Quirinale). However, he relented after key figures, including Prime Minister Mario Draghi , urged him to stay on for the "stability" of the Republic. His first term was set to expire on February 3.
    Parliamentarians who went to Quirinale to ask him to remain quoted Mattarella as saying "I had other plans, but if needed, I am at your disposition". Seven rounds of fruitless voting to determine a successor involved an electoral college of 1009 "grand electors". They comprise 321 Senators , 630 Members of the Chamber of Deputies (MPs) and 58 regional delegates.",
    description:"",
    domain:"en.wikinews.org",
    image:"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Italy_%28orthographic_projection%29.svg/1200px-Italy_%28orthographic_projection%29.svg.png",
    keywords:"",
    language:"en",
    title:"Italian President Sergio Mattarella re-elected for second term, ending successor row",
    url:"https://en.wikinews.org/wiki/Italian_President_Sergio_Mattarella_re-elected_for_second_term,_ending_successor_row"
}

Typically, in the workflow, the URL Converter block is followed by a model block or more model blocks in parallel.
In these cases, inside the model block's configuration, the content property of the URL Converter block's output must be mapped to the text input property.