Skip to main content


This page provides a reference guide to autoscaling for Gradient Deployments.

Configure autoscaling to handle scale up and down events for your Gradient deployment.

How autoscaling works

Gradient autoscaling uses the kubernetes horizontal pod autoscaler. Some defaults have been chosen to make it easier to quickly scale up and down the deployment.

Autoscaling will scale up and down the deployment based on a chosen metric , summary function and specified value. The number of current replicas for each deployment will never scale below replicas or above maxReplicas.

To change the autoscaling configuration, update the spec through the CLI.

Autoscaling configuration

enabled (default: true): Turn autoscaling on or off.

maxReplicas : The upper bound on the number of replicas that can be run by the deployment. The deployment’s active replicas will always fall in the range between the value of replicas and maxReplicas.

metric , summary and value :

metric - Sets the metric used to scale up or down.

summary - Sets the function used to calculate scale events.

value - The summary number that will cause the deployment to scale.

Autoscaling criteria

Multiple metrics can be used in the spec to determine when to scale. If you provide multiple metric blocks, the deployment will calculate a proposed replica counts for each metric, and then scale the instances to the value of the highest replica count.

The following metrics can be used:

cpuaverageAverage cpu utilization across each replica (% of 100)Integer
memoryaverageAverage memory utilization across each replica (% of 100)Integer
requestDurationaverageAverage request duration over a 5 minute period across all IPs behind the proxy (seconds)Integer

Autoscaling example

A spec that configures all metrics available for autoscaling. See scenarios below:

enabled: true
image: paperspace/deployment-fixture
port: 8888
replicas: 1
instanceType: C4
enabled: true
maxReplicas: 5
- metric: requestDuration
summary: average
value: 0.15
- metric: cpu
summary: average
value: 30
- metric: memory
summary: average
value: 45

Example Scenario: As requests begin to come through, the request duration over a 5 minute period is greater than 150 ms. As a result, the deployment scales up from 1 to 2 replicas. Over the next 5 minute interval the request duration is still longer than 150ms and the deployment scales to 3 replicas. After the stabilization period of 5 minutes, the deployment will begin to scale down as the request times have fallen below 150 ms.