Skip to content

Workflow inner workings

Introduction

A published workflow can be either synchronous or asynchronous.

Every workflow can be asynchronous, but not every workflow can be synchronous. In fact, if a workflow has blocks of one of these components:

it cannot be synchronous.

Note

It is possible that a runtime is configured by its administrator to allow only one of the two publishing modes.

Synchronous and asynchronous workflows are used through different APIs and have peculiar inner workings, as described below.

Synchronous workflows

API behavior

A synchronous workflow responds to API requests in a synchronous way: the client software makes an API call that sends the request, hangs while the workflow produces the response and unblocks when the response has been fully received. One simple API endpoint is all that is needed to use a synchronous workflow.

A possible comparison is with buying an hot dog from a street vendor: the customer shows up, asks for his hot dog and waits until it is ready.

Synchronous mode is suitable for stream processing with structurally simple workflows.

Cluster services

If a runtime is set up to host synchronous workflows its Kubernetes cluster contains:

  • A pod for the API gateway
  • A pod for the workflow orchestration service, shared between all workflows, which also takes care of the execution of JavaScript Interpreter v. 1.0.0 blocks and operators' blocks.
  • For each published synchronous workflow, if there are model, processor or custom components' blocks, a pod for each replica of each of those blocks.

If a workflow consists only of JavaScript Interpreter v. 1.0.0 blocks and operators' blocks, there are no workflow specific pods and everything is in charge of the orchestration service which is one per runtime, shared between all workflows.

How they work

When a new request for a workflow execution arrives, the API gateway puts it in the work queue of the orchestration service. The orchestration service picks the request and starts executing the workflow. It directly takes care of executing blocks of these components:

When it encounters a model or processor block in the flow, it prepares the JSON input for it, following the rules of the input mapping or taking the implicit input, and then invokes the service inside the block. If the block has multiple replicas, the cluster management software choses one randomly.
The orchestration service receives a response from the block and continues processing the flow with the next block or, at the end of the flow, it returns the overall workflow output to the API gateway.

Parallelism

The fact that the workflow is synchronous does not mean that it cannot handle multiple requests at the same time. Synchronous is just the way a client interacts with the workflow, but internally, in a multi-block workflow, one block can be engaged with one request while another block manages another.

The comparison is with a car assembly line: the various stations of the assembly line can each work on a different car. Each block of a workflow is comparable to an assembly station and therefore a workflow made of at least two blocks can handle multiple requests at the same time: while block B is dealing with request 1, block A can already be dealing with request 2: multiple requests can be in processing within the workflow at the same time.

Further parallelism can be obtained:

  • By creating independent flows: in fact a workflow can contain multiple flows and the only synchronization points are at the beginning and end of the flows.
  • Creating parallel branches with Fork and Join operators, either explicit or implicit.

Replicas

A block cannot handle multiple requests at the same time unless it has more than one replica.
Having two or more replicas of a block is like having replicas of an assembly station.
Let's imagine a car assembly line with three stations. Every 10 minutes the parts required to assemble a new car are brought in front of station 1. Station 1 requires 10 minutes to make its part of the assembly, station 2 takes 20 minutes and station 3 10 minutes. The bottleneck is hence station 2. Let's see what would happen in practice using the following illustration as a reference.

  • Minute 0: the parts to assemble the first car are brought to station 1.
  • Minutes 0-10: first car, station 1 at work.
  • Minute 10: station 1 passes the first car to station 2. Parts for the second car arrive at station 1.
  • Minutes 10-20: station 1 does the second car. Station 2 does half of its job on the first car.
  • Minute 20: station 1 cannot pass the second car to station 2 which is still busy, the car remains blocked at station 1. Parts for the third car arrive in front of station 1.
  • Minutes 20-30: second car blocked at station 1. Station 2 finishes its job on the first car.
  • Minute 30: station 2 passes the first car to station 3, station 1 passes the second car to station 2 and takes in the parts for the third car. Parts for the fourth car arrive in front of station 1.
  • Minutes 30-40: station 3 does the first car, station 2 does half the job on the second car, station 1 at work on the third car.
  • Minute 40: station 3 delivers the first car, but station 2 cannot pass the second car to it because only half job has been done. Station 1 is stuck, occupied by the third car. Parts for the fifth car arrive in front of station 1.
  • Minutes 40-50: station 3 is empty while station 2 finishes the second car. Station 1 is idle, occupied by the third car. Parts for two cars wait in front of station 1.
  • Minute 50: station 2 passes the second car to station 3, station 1 passes the third car to station 2 and takes in the parts to build the fourth car. Parts for the sixth car arrive in front of station 1.
  • Minutes 50-60: all stations busy. Parts for three cars lay in front of station 1.
  • Minute 60: station 3 delivers the second car. Station 2 remains busy with the third car. Station 1 cannot pass the fourth car to station 2. Parts for the seventh car arrive in front of station 1.

It can be noticed that:

  1. Parts for new cars accumulate before station 1.
  2. Station 1 is idle, occupied by a finished car, half of the time.
  3. Station 3 is empty and inactive half of the time.
  4. After the start-up phase, the assembly line delivers a complete car every 20 minutes, the time required by the slowest station.

A similar workflow might have the first block take 100 milliseconds on average, the second 200 milliseconds, and the remaining 100 milliseconds. If a request arrives every 100 milliseconds, the workflow produces a result every 200 milliseconds.

An optimal assembly line has stations that work more or less at the same speed, dividing the work into phases of similar duration. If this is not possible, the slower stations can be replicated.
Let's see what happens by having two replicas of station 2.

  • Minute 0: the parts to assemble the first car are brought to station 1.
  • Minutes 0-10: first car, station 1 at work.
  • Minute 10: station 1 passes the first car to one of the two replicas of station 2. Parts for the second car arrive at station 1.
  • Minutes 10-20: station 1 does the second car. The replica of station 2 does half of its job on the first car.
  • Minute 20: station 1 passes the second car to the other replica of station 2. Parts for the third car arrive in front of station 1.
  • Minutes 20-30: station 1 at work on the second car, both replicas of station 2 busy.
  • Minute 30: station 2 passes the first car to station 3, station 1 passes the second car to the replica of station 2 that has been freed and takes in the parts for the third car. Parts for the fourth car arrive in front of station 1.
  • Minutes 30-40: all stations busy.
  • Minute 40: station 3 delivers the first car, one replica of station 2 passes the second car to it and receives the fourth car from station 1. Parts for the fifth car arrive in front of station 1.
  • Minutes 40-50: all stations busy.
  • Minute 50: station 3 delivers the second car, one replica of station 2 passes the third car to it and receives the fifth car from station 1. Parts for the sixth car arrive in front of station 1.
  • Minutes 50-60: all stations busy.
  • Minute 60: station 3 delivers the third car, one replica of station 2 passes the fourth car to it and receives the sixth car from station 1. Parts for the seventh car arrive in front of station 1.

The overall effect of having two replicas of the slowest station:

  1. Parts do not accumulate before the line.
  2. All assembly stations are always active: there are no blocks and no inactivity.
  3. After the start-up phase the assembly line delivers a complete car every 10 minutes.

Please note that the overall time required to assemble a car doesn't change with replicas, it remains 40 minutes. What gets higher is the productivity of the assembly line that goes from 3 cars per hour to 6 cars per hour.

The same goes for the analogous workflow. Having more than one replica increases the workflow's ability to handle more requests at the same time, because while one replica of a block is busy with a request, another replica of the same block can work on another.
Adding replicas is useful, in case of multiple simultaneous clients, even with a workflow made of one block: instead of all queuing on the single replica, concurrent requests are distributed among all replicas, increasing the throughput; a bit like having two identical workflows with a load balancer in front of them.

When determining the number of replicas, as for the assembly line, give more replicas to the slower blocks. To determine the speed of individual blocks, the latency information that debug mode in interactive tests gives is useful.
For example, if block A takes on average twice as long as block B, it makes sense for it to have twice as many replicas.

Performance ceiling

Performance does not improve infinitely by adding replicas: given the role of the orchestration service, which is multi-threaded, but not replicated, after a certain number of replicas performance does not improve and by adding more replicas it actually gets worse.
As a rough guide, the peak of cumulative multi-client throughput, with a factory configuration, is with 10 concurrent clients and 10 replicas for the slowest block, but throughput can be increased—and considerably—by giving more computational resources to the orchestration service.

Also, the orchestration service only decides the cluster service—corresponding to the block—to invoke, but with multiple replicas it's the cluster management service that decides the replica to use, and it chooses blindly, in a random fashion, without any knowledge of the activity of the replica, so it can happen that it chooses a replica that is already busy with a previous request instead of a replica that is idle.

Error management

Synchronous workflows are intolerant to errors. Any error in any block determines the interruption of the execution and the return of an error. The client software gets an HTTP error status code like 500 (Internal Server error) and the response body contains an explanatory error message.

Timeouts

There is no limit to the duration of a call to a synchronous workflow, but individual blocks may have a timeout deployment property that determines the maximum duration of their processing. The time is counted from when the orchestration service invokes the service inside the block.
Some block may accept new requests from the orchestration service while they are still processing a previous request, so it's possible that one or more requests reach the timeout while they are waiting inside the block, without even entering the processing phase.
If a block takes more than its timeout to process a request, it raises an error condition (see above).

Cost

Keeping a workflow running, processing or ready to process, requires CPU and RAM, that is computers, that have a cost and consume energy.

If the NL Flow runtime cluster is on physical computers, typically the number of computers is established a priori and the computers are always on, whether the workflows are published or not. This setup has costs that we can define as fixed: the purchase or rental of the hardware, maintenance, the cost of energy and the network.

If instead, as is more frequent, the cluster is provided by Cloud services such as AWS EKS or Azure AKS, it can benefit from the Kubernetes autoscaler: new nodes—that is virtual computers—are procured on demand, turned on and added to the cluster when needed, then terminated when empty and no longer needed.

Publishing a synchronous workflow means requesting immediately and as long as the workflow remains published, the maximum number of replicas for each block and the allocation of all the CPU and RAM required by each replica. These are all settings decided by whoever designs the workflow.
This can lead to adding nodes to the cluster, if the existing nodes do not have sufficient resources.

The cost of the workflow can be calculated in required computers, dividing the computational footprint of the workflow by the capacity of a computer, which is chosen when installing the runtime.
For example, if a workflow has a computational footprint—visible in in the editor, on the left in the toolbar, and in the list of workflows—of 18000 thousandths of CPU and 40 GiB RAM and computers with 8 CPUs and 32 GiB of RAM are used, considering that a node has a quantity of resources available for workflows a little lower than its capacity due to the system software—operating system, Kubernetes—, three nodes are needed, one of which will remain partially empty or can be used in condominium with other workflows.

A published synchronous workflow costs the same whether it's used or not and this can be a waste of money and energy.
When a workflow is unpublished, all replicas of all blocks are deleted, freeing up computer resources. Following this and possible pod redistribution, one or more computers can become empty and the Kubernetes autoscaler then terminates them, zeroing their costs.
For this reason, if it is known that a workflow is not used on weekends, at night or in other time slots, it makes sense to agree with the installation administrator on the schedule of automatic unpublishing and republishing, or even shutdown and restart of the entire runtime cluster.

Asynchronous workflows

API Behavior

In classic mode a client start using an asynchronous workflow by sending the payload to the task creation endpoint that immediately returns a task ID.
The workflow then queues the task and executes it once the computing resources have been freed from previous tasks. At the end of the execution, the outcome of the task gets stored.
The client software periodically inquires about the task status, that can be:

  • SUBMITTED immediately after the creation
  • RUNNING as soon as the task gets queued and for as long as it's being executed
  • COMPLETED or ERROR when the task is finished and the outcome is ready

When the task is finished, the client software can make a request to get the outcome and normally makes another request to delete the any trace of the task to free resources. Undeleted tasks are automatically deleted by the NL Flow runtime after 24 hours.

A possible comparison is that of an online purchase with delivery to a locker: the customer buys the goods and quickly receives confirmation of his order. Every now and then he checks for updates and when the goods are delivered he collects them from the locker.

Asynchronous workflows can be also used synchronously through the all-in-one mode. Behind the scene nothing changes: a task is created, queued and executed, then result are stored. However the client has just to make one call at the end of which it gets the outcome of the invisible task. The server automatically deletes the traces of the task after the call.

Classic mode is suggested for workflows with relatively high execution times, for example for the processing of ZIP archives containing multiple documents or e-mail messages that can have large and nested attachments. All-in-one mode, on the other hand, is good for fast workflows or when classic mode is unpractical.

Cluster services

If a runtime is set up to host asynchronous workflows, its Kubernetes cluster contains:

  • A pod for the API gateway
  • A pod for the task management service, shared between all workflows
  • A pod for the workflow orchestration service, shared between all workflows
  • For each published asynchronous workflow:

    If there are blocks of:

    a pod for each replica of each of these blocks.
    Then, for shared services:

    • A pod for each replica of Output Producer.
    • A pod for each replica of Script Interpreter, if there are blocks of JavaScript Interpreter 2.0.0, Python Interpreter or Map and Switch operator.
    • A pod for each replica of Proxy Legacy, if there are blocks developed with legacy technology including JavaScript Interpreter v. 1.0.0.

The amount of replicas active at any given time for blocks and services is determined by the maximum number of replicas set by the designer, the possible activation of block autoscaling and, in that case, the instantaneous load and the autoscaling parameters.

How they work

Requests to execute a workflow are submitted via the API gateway to the task management service.
For each request received, the management service creates a new task, returns the task ID to the requester, stores the JSON input in a storage, writes a message to the queue of the workflow orchestration service and puts the task in the SUBMITTED status.

The workflow orchestration service reads the message from the task management service, writes a message to the queue of all the first blocks of the workflow flows and writes a message to the queue of the task management service to signal the fact that the workflow has started. In reaction to this, the task management service puts the task in the RUNNING status.

During the flow processing, the orchestration service, in addition to its main job of orchestrating the overall execution, directly takes care of the execution of the operator blocks, excluding Map and Switch that are executed by the shared script interpreter service as they are based on JavaScript.

Block and service replicas constantly poll their message queues.
As written above for the first blocks of the flows, to make a replica of a block work, the orchestration service writes a message to the queue of the block, that is a queue that all replicas access concurrently. The first replica that arrives takes the message, gets the input from storage, following the rules of the input mapping or taking the implicit input.
The replica does its job then writes the JSON output to storage and writes a message to the orchestration service queue to signal that it is done.
The orchestration service reads the messages in its queue and proceeds with the flow as above, possibly taking care of:

  • Sending messages to multiple downstream blocks when the flow branches (downstream of a Fork, a Join-Fork or possibly also a Switch when multiple conditions are satisfied).
  • Sending messages to the same block iteratively when the flow enters or is in a context generated by a splitter or remapper).
  • Wait for all upstream blocks to finish before proceeding, at convergence points (Join, Join-Fork, End Context and End Switch).

At the end of the flow or in case of errors exceeding the maximum number allowed, the orchestration service writes a message to the queue of the workflow output production service that stores the output itself in storage and communicates to the task management service to set the task status to COMPLETED or ERROR.

The task management service:

  1. Responds to requests on the status of tasks.
  2. Responds to requests to have the response of tasks.
  3. Can provide the initial request with which a task was created.
  4. Handles requests to delete tasks.
  5. Puts RUNNING tasks in the DROPPED status when task duration exceeds the workflow timeout or when the workflow is unpublished.

Parallelism

Like a synchronous workflow (see above), even in an asynchronous workflow the various blocks and services can be engaged at the same time on multiple different tasks, like an assembly line that works on multiple cars at the same time.
The difference here is that while in the synchronous workflow it is the orchestration service that "invokes" the replicas of the model and processor blocks, in the asynchronous workflow it is the replicas of blocks and services that procure work to do by themselves by reading the message queues (self-service mechanism).

Multiple replicas allow for a higher throughput, exactly like for synchronous workflows.

Like synchronous workflows, asynchronous workflows can also be parallelized by creating independent flows and creating branches with explicit or implicit Fork and Join operators, but asynchronous workflows can then have a further type of parallelism with Switch and contexts.
In fact, a Switch block can provide that multiple downstream branches are followed if multiple conditions are satisfied, and this happens in parallel, similarly to a Join.

In a context, created with a splitter or a remapper, the part of the flow inside the context is executed with the maximum possible parallelism, based on the available replicas, to possibly process multiple items at the same time.
For example, if a PDF Splitter block produces 10 items, that is ten single-page PDFs, and in the context there is a TikaTesseract Converter block followed by a symbolic model, if these two blocks have a maximum of one replica each the workflow will only be able to process one item at a time, one after the other.
If instead the blocks have replicas, the items are distributed between them.
Continuing with the example, if the TikaTessercat Converter block has 2 replicas and the model block has 5, TikaTesseract Converter will be able to work on two PDF pages at the same time. When a replica produces the text of a page, any replica of the downstream model can work on that text, independently from the other pages. The 5 replicas of the model continuously poll the message queue of their block to grab page texts as they materialize, no matter in what order.
If TikaTesseract Converter is faster, on average, than the model, it makes sense to give more replicas to the model block because the page texts will be ready relatively quickly and the more replicas of the model can "consume" them the better for performance.
The end of the context represents a sync point: if the model replicas quickly finish 9 pages, perhaps there can be a page text that takes longer and the flow does not proceed downstream of the context until that is finished as well.
In principle, however, the more blocks are replicated within a context, the shorter the context lasts for a given workflow request, because more of the items produced by the splitter or remapper can be processed at the same time.

In general, the rule for determining the number of replicas is the same as for synchronous workflows: give more replicas to the slower blocks.

Another important difference with synchronous workflows is that the degree of parallelism of blocks and services of an asynchronous workflow can change dynamically based on the load by enabling workflow autoscaling.
At rest, the number of replicas of a block or service can be low or even zero. Then, as load materializes, a number of replicas that is proportional to the load is turned on, increasing the parallelism and hence the throughput.
When the load goes down, replicas are removed.

Performance ceiling

The ceiling phenomenon can occur for an asynchronous workflow as well as for a synchronous one, but in the case of less than elementary workflows, the asynchronous workflow has an advantage because the role of the orchestration service is reduced: the self-service capability of blocks and services reduces the couplings and performance ceiling tends to occur higher up.

By appropriately sizing the task management service, the workflow orchestration service, the replicas of blocks and services and, possibly, the autoscaling parameters based on the load, the phenomenon can be avoided altogether or moved relatively much higher up.

Due to the stateful nature of asynchronous workflows, the path of balancing multiple runtimes to scale performance and add fault tolerance is ess easy to follow than with synchronous workflows, but is still viable. The tricky part is configuring the affinity between the load balancer and the runtime, which is necessary for the correct management of tasks from the client perspective: the runtime that creates a task must be the same that knows about it status and can return its response.

Error management

Unlike synchronous workflows, asynchronous workflows can tolerate errors.

The first tolerance mechanism is established when publishing the workflow. In that moment it is possible to specify whether to make retries after an error and the maximum number of such retries.

The second tolerance mechanism is available for splitter contexts and can be set as a functional property of the End Context block that closes the context. The context can be tolerant to a certain number or a certain percentage of failing iterations.

The third and last tolerance mechanism applies to the downstream branches of the Switch operator and is set as a functional property of the End Switch block on which the conditional branches converge. The tolerance is relative to the maximum number of branches that can fail.

The first error that exceeds the tolerance thresholds makes the workflow fail, causing the task to go in the ERROR state.

Timeouts

Unlike synchronous workflows, asynchronous workflows have a global timeout that is set when publishing the workflow.
If a task stays in RUNNING status more than the workflow timeout, the task is put in the DROPPED status.

Blocks' timeout apply, as for synchronous workflows, and they raise error conditions that may or may not cause the workflow to fail based on the configuration of the tolerance mechanisms.

Cost

In general, what is stated above about the cost of synchronous workflows is also true for asynchronous workflows. If workflow autoscaling is disabled, it's indeed the same: all the computers that are necessary to run all the replicas of every block and service are turned on immediately and kept on until the workflow is unpublished. This guarantees maximum readiness and performance, but is also the most expensive option and can lead to a waste of money and energy when the workflow is kept published but is not used.

Info

If a workflow can be synchronous or asynchronous, the asynchronous one will need more computing resources because of the shared services, that is Output Producer and, possibly, Script Interpreter and Proxy Legacy.

With workflow autoscaling enabled there are interesting differences because the number of computers turned on can vary based on the load and theoretically drop to zero, if not considering the computers that host the basic software or other workflows.
This makes costs less preventable, but surely gives the opportunity for savings.
The choice between enabling or not enabling autoscaling and between the various autoscaling configurations is driven by the performance you want to achieve. Speed ​​has a price, so performance-oriented autoscaling settings will result in more computers turned on longer compared to an economy-oriented configuration that however is less responsive to load and has a lower average throughput.

Autoscaling

Overview

Autoscaling is a feature of asynchronous workflows and consists of dynamically increasing or decreasing the number of replicas of blocks and shared services as the load, that is processing requests, varies.

The user chooses whether or not to activate autoscaling when publishing the asynchronous workflow in a runtime.
The parameters that govern the autoscaling of replicas are chosen at that time and are the same for all blocks and shared services, however each block or service can scale differently because it can have a different load.
The instantaneous load on a block or service, in fact, depends on various variables: the point in the flow, the activity and speed of the block/service and of the upstream blocks, the quantity of requests that the workflow is managing at the same time. For example, a splitter can have a load of 1 when the blocks in its context have a load of n, with n equal to the number of items that the splitter produces. Also, a fast block has a lower load than a slow block because it clears its queue more quickly.

Autoscaling allows you to save money and energy because the computational footprint (CPU and RAM) of the workflow is reduced or even eliminated when there is no load. On the other hand, a workflow with autoscaling active cannot be as reactive as one without autoscaling that always has all the replicas of all the blocks and services always active.

If autoscaling is enabled, the system periodically measures the load on the single block or shared service that can have replicas.
Based on the measured load, it checks whether there are conditions to scale the replicas up or down.

Load measurement

Every time interval corresponding to the Polling Interval parameter, the system measures the load of each block or shared service that can have replicas.

If Mode is equal to Queue Length, the load is the number of messages in the queue of the block or service.
If Mode is equal to Message Rate, the load is the average amount of messages entered into the queue each second during the previous polling interval.

Number of required replicas

Once the load is measured, the system determines the number of required replicas.

The number of replicas cannot be less than Min Replicas and cannot be greater than the maximum value set by the designer for the block or service.

The target value is calculated by dividing the load by Value:

Target = load / Value

Considering the minimum threshold and the maximum value, the complete formula is:

Required replicas = MAX(Min Replicas, MIN(maximum set by designer, target))

Need, direction and confirmation of scaling

Need and direction of scaling—up or down—are determined by this rule:

  • Required replicas = replicas currently turned on? → no scaling
  • Required replicas > replicas currently turned on? → scale up
  • Required replicas < replicas currently turned on? → scale down

In the special case of scaling "from zero", that is with no replicas already turned on, the system evaluates the Activation Value parameter and confirms the scaling only if the load is higher than the value of this parameter.

Scale amount

If the scaling is confirmed, the replicas to be turned on or deleted are those that are needed to reach the number of required replicas.

In case of upscaling, the number of replicas to be turned on is:

Required replicas - replicas currently turned on

For example:

  • Required replicas: 5
  • Replicas currently turned on: 2
  • Replicas to be turned on: 3

In case of downscaling, the number of replicas to be deleted is:

Inverse of: required replicas - replicas currently turned on

For example:

  • Required replicas: 2
  • Replicas turned on: 5
  • Replicas to be deleted: inverse of 2 - 5 = inverse of -3 = 3

Performing the scaling

The system orders the upsacling immediately.
If there are replicas marked for deletion following a previous downscaling, the system unmarks a number of them equal to the smaller of the replicas to be turned on and the marked ones. For example:

  • If it has to turn on 3 replicas and finds 2 marked for deletion, it unmarks them all, because MIN (3, 2) is 2. It has only one replica left to turn on.
  • If it has to turn on 3 replicas and finds 6 marked, it unmarks 3, because MIN (3, 6) is 3. It does not need to turn on more replicas.

The unmarked replicas were turned on, they remain turned on and will no longer be deleted at the end of the cooldown period.

After that this calculation is done:

Replicas to turn on - unmarked replicas

and if the result is greater than zero, the system proceeds to ask the cluster to turn on that number of replicas.

Additional replicas are powered on immediately if there is space in the existing virtual computers. Otherwise, if the maximum number of computers set by the installation administrator has not been reached and the system is Cloud based, a number of virtual computers adequate to host the replicas that do not find space in those already active are procured and started. In this case the time to power on the replicas can be in the order of a few minutes.
Scaling remains pending if:

  • Adding the necessary computers would lead to exceeding the maximum number of computers of the runtime set by the installer.
  • The Cloud provider does not have the type of computer requested available.

Downscaling actually occurs after a cooling period determined by the Cooldown Period parameter. The replicas to be deleted are immediately marked for this fate, but if during the cooldown period the need to scale up is determined, one or more replicas marked for deletion—depending on the number of replicas to be powered on—are unmarked, thus reducing or avoiding the need to power on replicas (see above).
Replicas that remain marked until the end of the cooldown period are effectively deleted.

For example, the system determines that 3 replicas need to be deleted.
All are immediately marked for deletion and the cooldown period countdown begins.
For six consecutive polling intervals there is no change: the replicas remain marked for deletion.
At the seventh polling interval from the start of the countdown it is determined that scaling up is necessary and 2 replicas need to be turned on: 2 replicas marked for deletion are then unmarked.
In the subsequent polling intervals until the end of the cooldown period the only replica left marked for deletion remains unchanged, so that replica is eventually deleted. Deleting replicas frees up computational resources (CPU and RAM) in the virtual computers and this can lead to them becoming empty. The cluster autoscaler then takes care of terminating the empty computers.